bootstrapworld / curriculum

6 stars 7 forks source link

move "outliers: should they stay or should they go" to threats to validity #2252

Open retabak opened 3 weeks ago

retabak commented 3 weeks ago

While we definitely need instruction on what outliers are / how to identify them in the histograms lesson, this conversation about "keep the outlier or ditch it?" belongs elsewhere...

@lesson-point{ Outliers... do they stay or do they go? }

@right{@image{images/height-outlier.png, 300}}Suppose we survey the heights of 12 year olds, and almost all values are clustered between 50-70in. There's a very low outlier, however, at 6in.

@QandA{ @Q{Is there really a 12 year old who is 6 inches tall?} @A{Probably not! This could very well be a typo (maybe someone meant to type "60" instead of "6"?).} }

"Junk" data is harmful, because it can drastically change your results!

@slidebreak

@right{@image{images/stadium-outlier.png, 300}}Suppose we survey the number of minutes it takes for fans to find their seats at a stadium, and almost all values are clustered between 4-16 minutes. There's a very high outlier, however, at 35 minutes.

@QandA{ @Q{Did it really take someone 35 minutes to find their seat?} @A{It's very possible! Maybe it's someone who takes a long time getting up stairs, or someone who had to go far out of their way to use the wheelchair ramp!} }

An outlier can also could be a really important part of your analysis!

@slidebreak

As a data scientist, an outlier is always a reason to look closer. And whether you decide to keep or remove it from your dataset, make sure you explain your reasons in your write-up!

@lesson-instruction{ With your partner, complete @printable-exercise{outliers-discussion.adoc}.}

These points are called unusual observations. Unusual observations in a scatter plot are like outliers in a histogram, but more complicated because it’s the combination of x and y values that makes them stand apart from the rest of the cloud.

@slidebreak

@lesson-point{ Unusual observations are always worth thinking about! }

  • Sometimes unusual observations are just random. Felix seems to have been adopted quickly, considering how much he weighs. Maybe he just met the right family early, or maybe we find out he lives nearby, got lost and his family came to get him. In that case, we might need to do some deep thinking about whether or not it’s appropriate to remove him from our dataset.

@slidebreak

  • Sometimes unusual observations can give you a deeper insight into your data. Maybe Felix is a special, popular (and heavy!) breed of cat, and we discover that our dataset is missing an important column for breed!

@slidebreak

  • Sometimes unusual observations are the points we are looking for! What if we wanted to know which restaurants are a good value, and which are rip-offs? We could make a scatter plot of restaurant reviews vs. prices, and look for an observation that’s high above the rest of the points. That would be a restaurant whose reviews are unusually good for the price. An observation way below the cloud would be a really bad deal.
  • [ ] generally, make sure that we are actually providing examples and practice for each of the kinds of threats to validity that are mentioned. While making the assessments in summer 2024 Rachel flagged that the work we're asking students to do on the pages in this lesson doesn't align with instruction
schanzer commented 3 weeks ago

Just to record the history, Rachel and I spoke about this by phone and the end product is https://github.com/bootstrapworld/curriculum/issues/2252

flannery-denny commented 1 week ago

I'm looking at the scatter plots lesson and the third section "looking for trends" is super scattered and needs a rewrite. (I've started working on this lesson on the alg2-split branch.) It also assumes that outliers are being discussed in histograms.

These points are called unusual observations. Unusual observations in a scatter plot are like outliers in a histogram, but more complicated because it’s the combination of x and y values that makes them stand apart from the rest of the cloud.

Wondering if I've misunderstood this issue and @retabak is planning to discuss outliers in the histograms lesson, just not with this page, or whether we need to decide whether we should move all discussion of outliers to threats of validity or are going to discuss them in both histograms and scatter plots. adding it to tomorrow's curriculum meeting agenda in case we need to discuss.

If we're going to continue to discuss outliers in both lessons, I'm going to move this page to scatter plots.

retabak commented 6 days ago

Students will still learn what an outlier is during the dot plots lesson, and they will think about how outliers affect shape. That said, I believe that the conversation about "should they stay or should they go" is a distraction from other foundational content in these early lessons.