OpenIntroStat / oilabs-tidy

👩🏿‍💻 OpenIntro Labs in R using the tidyverse design philosophy, grammar, and data structures
http://openintrostat.github.io/oilabs-tidy/
Creative Commons Attribution Share Alike 4.0 International
67 stars 86 forks source link

Convert everything to data frames and remove square bracket notation? #2

Closed andrewpbray closed 9 years ago

andrewpbray commented 9 years ago

This would be a major rewrite, but (without looking through all the labs) it seems like it'd be possible to remove all references to the vector structure of R and just use a lot of select(). An alternative would be to not go whole hog in the dataframe direction and leave some vectors in. But the arguements for the full rewrite:

Pros:

Cons:

One thing that I think we would need to add if we did this is a lab that did focus on vectors, constructing data frames, and manipulating them using tidyr.

beanumber commented 9 years ago

+1 +1 +1 +1 !!!!

Also, I've already removed the for loops from the labs in the mosaic labs, so that work is already done.

mine-cetinkaya-rundel commented 9 years ago

The one thing I like about for loops is how it parallels the explanation of the sampling distribution. Do we think that's worth giving up? @beanumber what has been your experience in tying the computation to concepts when avoiding for loops?

beanumber commented 9 years ago

To me, mosaic::do() is even more intuitive for sampling distributions than for. Why should students have to worry about array indexes? The idea is just to do the same thing over and over again, right?

andrewpbray commented 9 years ago

I do (ha) worry that do suppresses the idea of iteration a bit. In the for loop, it's (maybe?) clear that the index is changing on every iteration. With do, you just run it once and get the sampling distribution. Just spitballing here, but would it be useful to print some output to the console to indicate that it is iterating? Not necessarily a progress bar, but maybe an asterisk for each iteration? Or if they go to 1e6 iterations, there could be a little key at the bottom that every * represents 100 iterations.

mine-cetinkaya-rundel commented 9 years ago

The calculations are so quick that I'm not sure the progress bar will be useful. Also wouldn't this require using a custom function. I'd prefer avoiding that for this task.

andrewpbray commented 9 years ago

Well if this were a good enough idea, Randy might be interested in incorporating it into the mosaic package, so it wouldn't be a custom function. Course, it's probably not a good enough idea =)

In any event, if we used do() for for loops, we'd have to either read in mosaic or reimplement it in the openintro package. Hey, maybe that's one compelling reason to keep the oilabs package separate: it could/would depend on dplyr, ggplot2, and mosaic, which aren't lightweight dependencies. Might be poor form to bloat the dependency list of the openintro package for people that aren't doing the labs.

beanumber commented 9 years ago

Check out the presentation of the do function here:

https://github.com/beanumber/oiLabs-mosaic/blob/master/sampling_distributions/sampling_distributions.Rmd

and compare with the original:

https://github.com/andrewpbray/oiLabs/blob/master/sampling_distributions/sampling_distributions.Rmd

Is there enough there about iteration? Also, I was under the impression that for is generally to be avoided in R, since there is almost always a cleaner way to do whatever you are trying to do (e.g. mosaic::do(), dplyr::do(), replicate, lapply, etc.).

I suspect that @rpruim would be OK with you copying the source code for mosaic::do() directly into openintro if that is the only mosaic function that you wanted.

rudeboybert commented 9 years ago

github is down, so I can't comment on the do vs for comparison.

I will say there is something to be said about leaving in the for loop since this is a universal programming concept. Students with no programming experience would benefit from exposure. As Mine said, repeated sampling is a good example use. I don't feel strongly otherwise.

On Mon, Jul 20, 2015 at 9:46 AM, Ben Baumer notifications@github.com wrote:

Check out the presentation of the do function here:

https://github.com/beanumber/oiLabs-mosaic/blob/master/sampling_distributions/sampling_distributions.Rmd

and compare with the original:

https://github.com/andrewpbray/oiLabs/blob/master/sampling_distributions/sampling_distributions.Rmd

Is there enough there about iteration? Also, I was under the impression that for is generally to be avoided in R, since there is almost always a cleaner way to do whatever you are trying to do (e.g. mosaic::do(), dplyr::do(), replicate, lapply, etc.).

I suspect that @rpruim https://github.com/rpruim would be OK with you copying the source code for mosaic::do() directly into openintro if that is the only mosaic function that you wanted.

— Reply to this email directly or view it on GitHub https://github.com/andrewpbray/oiLabs-dplyr/issues/2#issuecomment-122890256 .

"Master technique but let the spirit prevail."

rpruim commented 9 years ago

A few comments:

  1. It looks like you don't have the most recent version of mosaic. The do() function now detects when you are using mean() and labels things mean instead of result:
do(3) * mean( ~length, data = resample(KidsFeet))
##       mean
## 1 24.41026
## 2 24.79744
## 3 25.04615
  1. There is no need to copy the do() code (which would require more than just copying the do() function, by the way) since you can selectively import and export (if there were a reason to do so). Alternatively, one could simply require(mosaic) in places where it is used.
  2. do() is going to receive another upgrade soon. The new version will make it easier to users to create and use custom culling functions.
  3. for() must be used with care if performance matters. Little things, like preallocating memory with sample_means50 <- rep(NA, 5000) make a big difference (but are probably not what you want people in intro stats focusing on). If I were to do this without do(), I would use replicate(). It's designed for the task and easier to write.
  4. Using do() comes at a small cost since it is doing more. The advantage of do() is the extra data extraction it performs in more complicated situations. That data extraction costs some time. But do() itself is about as fast as it can be given the task that is it doing. (If you load the parallel package first, that will more than make up for the culling overhead and make do() faster.)
  5. The real time bottleneck at the moment is commands like mean( ~ Gr.Liv.Area, data=ames). This is much slower than mean(ames$Gr.LivArea), and I don't know if there is a way to speed it up. Basically, I need to see if there is a way to rewrite maggregate(). (There is an open issue about this in the mosaic package, but I've haven't thought about it in a while. Perhaps I can take another look before we go 1.0 and see if we can improve this.)
microbenchmark( times = 1000, 
  mean(ames$Gr.Liv.Area),
  mean(~Gr.Liv.Area, data=ames)
)
## Unit: microseconds
##                             expr     min       lq     mean   median      uq      max neval cld
##           mean(ames$Gr.Liv.Area)  63.656  70.6805  76.4298  74.5945  78.114  169.841  1000  a 
##  mean(~Gr.Liv.Area, data = ames) 713.592 740.2755 784.6416 755.8905 784.328 3108.859  1000   b
rpruim commented 9 years ago

I think there is no compelling argument that for() is the natural way for anyone to think/talk about a sampling distribution or bootstrap distribution unless they have already seen for loops , which most intro stats students have not (and probably not even then). I challenge you to find an existing intro stats book that is not linked to technology/software that describes them that way. I certainly have never used that sort of language in my teaching -- even when I teach engineering and computer science students who already know about for loops. I will certainly talk about doing things repeatedly ("now imagine that we take lots of samples of the same size..."), but I don't mention initialization or indexing (two important things in the for-loop framework) or use any words that sound like informal for loops.

As for googling to find code examples, this is a dangerous thing no matter what you choose because (a) coding standards in the R community are not that high globally, and (b) there are multiple programming patterns being used.

andrewpbray commented 9 years ago

Thanks for the updates on what's going on with the do() function, @rpruim.

With these labs I'm not worried much at all about performance. I think it'll be awhile before these students will be in a scenario when they need to take performance into consideration, and by then I'd guess they'll be more computationally mature so that it'll be no big deal to learn the more efficient formulation.

I agree that it's hard to see a fundamental link between a sampling distribution and the syntax of a for-loop. My main concern is to avoid sampling distributions that just appear at the click of a button or the run of a single line of code, at least at the beginning (this is a problem with both for() and do()). To get the idea of (re)sampling variability across, I feel like we can't get any better than shuffling cards by hand in class. It feels like a big jump to go from that to the super-easy computation in R.

Many people bridge this jump with the intermediate step of software with animations - fathom, applets, statkey. This would be doable in these labs with r markdown + shiny, at least for the first time that they see it. Course it would require that they run the lab within their own R session, which would definitely raise the complexity of the suite of fancy tools that we're using in these labs.

beanumber commented 9 years ago

@andrewpbray Oohh, some Shiny apps to complement some of the things in the labs would be really nice.