Inference for categorical data lab

rudeboybert commented 9 years ago

OYO Q1

(Also noted upstream in the dplyr/ggplot lab): In the On Your Own section, Q1.a) you ask students to "Form confidence intervals for the true proportion of atheists in both years, and determine whether they overlap." I think a better approach is via a single confidence interval on the difference in proportions.

Even though two individual confidence intervals may overlap, suggesting they are not different, the confidence interval of the difference might still suggest they are in fact different. (If you need an example of this, let me know) This is a common misinterpretation of bar plots with error bars (i.e. dynamite plots).

Success-Failure Condition Section

I get that you are trying to avoid using a for loop in the following line of code:

p_hats <- do(5000) * tally(~resample(responses, size = n, prob = c(p, 1-p)), format = 'proportion')

but this is a lot of nested parentheses/arguments for the students to digest. The dplyr/ggplot lab uses a for loop and thus allows for a breakdown of the nested tally/resample. Perhaps show one instance of tally/resample on its own and then do the repeating?

Exercise 10

IMO Not having histogram() play nice with par(mfrow=c(2,2)) defeats the purpose of this exercise. Having simultaneous comparisons of the sampling distribution for different (n, p) pairs really brings home the necessity of the success/failure condition. A trellis plot like in the Sampling Distribution lab would be ideal. FWIW I showed students a Shiny App where the title emphasizes the np and n(1-p) to compensate.

beanumber commented 9 years ago

See (https://github.com/ProjectMOSAIC/mosaic/issues/551#issuecomment-157389718) on the success-failure condition.

I see your point about the code. I'm trying to come up with a piped solution that works.

I don't see the for loop is a better solution, though, since now you have to introduce the concept of a for to a situation where there is no logical need for it. The better solution would be (as you suggest) to break down the do loop into smaller pieces.

beanumber commented 9 years ago

Regarding Exercise 10: histogram() and par are fundamentally incompatible, I believe.

Here, it may be worth emphasizing the tidy data solution, since then, this is exactly where I think the lattice graphics shine.

The hard part may be getting students to not think about four separate data frames, but rather one long data frame that has columns for n and p.

rudeboybert commented 9 years ago

Yep, similar to what was done in the Sampling Distributions lab. This is an argument for having a "tidy data" section in the Intro to Data Lab @andrewpbray mentioned.

andrewpbray commented 9 years ago

I think it's telling that, already, the Intro to Data lab has been the longest lab and yet there is still more concepts and skills that should be added, such as tidy data. Seems like the two solutions would be to a) open up more space in the course early on for data wrangling or b) figure out what are the first steps in data wrangling, put those in the Intro to Data lab, then follow up with the remainder in a later lab (grouped operations perhaps).

AmeliaMN commented 9 years ago

I think some of you recently talked in a big-picture way about how the labs are going to be managed going forward (sounds like a distributed approach?) but we should probably also talk as a large group about the list of things we really want students to learn and consider how to divvy them up.

Ideally, there would be interesting questions to solve in each lab and you would learn some data wrangling skills that would help you with that particular problem and also be useful for the next problems.

As an aside, maybe someone who was on that recent call could summarize it on the wiki of one of these repos, so all the more tangential contributors can see what the new best practices are?

rudeboybert commented 9 years ago

Amelia's idea is an interesting one. Along the lines of the discussions we had on the twin goals of the inference() function: as a tool to teach the textbook material and a tool for actual analysis, each lab concurrently covers topics from two prongs:

Demonstrate the material covered in the textbook: simulations, sampling distributions, probability. The original purpose of the labs.
Teach them data wrangling/tidying/visualization skills/concepts: what we've been doing the last few months.

This would take the pressure off having all of prong 2 being in a single "Intro to Data Lab". As Amelia said, we'd need to make an explicit list of data wrangling/tidying/visualization skills/concepts we want students to acquire, divvy the tasks up, and then spread them over all the labs in explicitly labeled sections with the same header.

beanumber commented 9 years ago

Yes, but, I think we should take some care to make most of the labs fairly self-contained, since people are teaching them in different orders, and so it's not necessarily OK to assume that if you are doing one lab that you have done the previous labs.

For example, in our class (with @AmeliaMN) we did Intro to R and then Intro to Data, but then jumped straight to Simple Linear Regression. @nicholasjhorton also does it this way.

beanumber commented 8 years ago

p_hats <- 
  do(50) * 
  responses %>%
  resample(size = n, prob = c(p, 1-p)) %>%
  tally(format = "proportion")

beanumber / oiLabs-mosaic