Closed andrewpbray closed 9 years ago
+1 +1 +1 +1 !!!!
Also, I've already removed the for
loops from the labs in the mosaic labs, so that work is already done.
The one thing I like about for
loops is how it parallels the explanation of the sampling distribution. Do we think that's worth giving up? @beanumber what has been your experience in tying the computation to concepts when avoiding for
loops?
To me, mosaic::do()
is even more intuitive for sampling distributions than for
. Why should students have to worry about array indexes? The idea is just to do the same thing over and over again, right?
I do (ha) worry that do suppresses the idea of iteration a bit. In the for loop, it's (maybe?) clear that the index is changing on every iteration. With do, you just run it once and get the sampling distribution. Just spitballing here, but would it be useful to print some output to the console to indicate that it is iterating? Not necessarily a progress bar, but maybe an asterisk for each iteration? Or if they go to 1e6 iterations, there could be a little key at the bottom that every * represents 100 iterations.
The calculations are so quick that I'm not sure the progress bar will be useful. Also wouldn't this require using a custom function. I'd prefer avoiding that for this task.
Well if this were a good enough idea, Randy might be interested in incorporating it into the mosaic package, so it wouldn't be a custom function. Course, it's probably not a good enough idea =)
In any event, if we used do()
for for loops, we'd have to either read in mosaic or reimplement it in the openintro package. Hey, maybe that's one compelling reason to keep the oilabs package separate: it could/would depend on dplyr, ggplot2, and mosaic, which aren't lightweight dependencies. Might be poor form to bloat the dependency list of the openintro package for people that aren't doing the labs.
Check out the presentation of the do
function here:
and compare with the original:
https://github.com/andrewpbray/oiLabs/blob/master/sampling_distributions/sampling_distributions.Rmd
Is there enough there about iteration? Also, I was under the impression that for
is generally to be avoided in R, since there is almost always a cleaner way to do whatever you are trying to do (e.g. mosaic::do()
, dplyr::do()
, replicate
, lapply
, etc.).
I suspect that @rpruim would be OK with you copying the source code for mosaic::do()
directly into openintro
if that is the only mosaic
function that you wanted.
github is down, so I can't comment on the do vs for comparison.
I will say there is something to be said about leaving in the for loop since this is a universal programming concept. Students with no programming experience would benefit from exposure. As Mine said, repeated sampling is a good example use. I don't feel strongly otherwise.
On Mon, Jul 20, 2015 at 9:46 AM, Ben Baumer notifications@github.com wrote:
Check out the presentation of the do function here:
and compare with the original:
https://github.com/andrewpbray/oiLabs/blob/master/sampling_distributions/sampling_distributions.Rmd
Is there enough there about iteration? Also, I was under the impression that for is generally to be avoided in R, since there is almost always a cleaner way to do whatever you are trying to do (e.g. mosaic::do(), dplyr::do(), replicate, lapply, etc.).
I suspect that @rpruim https://github.com/rpruim would be OK with you copying the source code for mosaic::do() directly into openintro if that is the only mosaic function that you wanted.
— Reply to this email directly or view it on GitHub https://github.com/andrewpbray/oiLabs-dplyr/issues/2#issuecomment-122890256 .
"Master technique but let the spirit prevail."
A few comments:
mosaic
. The do()
function now detects when you are using mean()
and labels things mean
instead of result
:do(3) * mean( ~length, data = resample(KidsFeet))
## mean
## 1 24.41026
## 2 24.79744
## 3 25.04615
do()
code (which would require more than just copying the do()
function, by the way) since you can selectively import and export (if there were a reason to do so). Alternatively, one could simply require(mosaic)
in places where it is used.do()
is going to receive another upgrade soon. The new version will make it easier to users to create and use custom culling functions.for()
must be used with care if performance matters. Little things, like preallocating memory with sample_means50 <- rep(NA, 5000)
make a big difference (but are probably not what you want people in intro stats focusing on). If I were to do this without do()
, I would use replicate()
. It's designed for the task and easier to write.do()
comes at a small cost since it is doing more. The advantage of do()
is the extra data extraction it performs in more complicated situations. That data extraction costs some time. But do()
itself is about as fast as it can be given the task that is it doing. (If you load the parallel
package first, that will more than make up for the culling overhead and make do()
faster.)mean( ~ Gr.Liv.Area, data=ames)
. This is much slower than mean(ames$Gr.LivArea)
, and I don't know if there is a way to speed it up. Basically, I need to see if there is a way to rewrite maggregate()
. (There is an open issue about this in the mosaic
package, but I've haven't thought about it in a while. Perhaps I can take another look before we go 1.0 and see if we can improve this.)microbenchmark( times = 1000,
mean(ames$Gr.Liv.Area),
mean(~Gr.Liv.Area, data=ames)
)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## mean(ames$Gr.Liv.Area) 63.656 70.6805 76.4298 74.5945 78.114 169.841 1000 a
## mean(~Gr.Liv.Area, data = ames) 713.592 740.2755 784.6416 755.8905 784.328 3108.859 1000 b
I think there is no compelling argument that for()
is the natural way for anyone to think/talk about a sampling distribution or bootstrap distribution unless they have already seen for loops , which most intro stats students have not (and probably not even then). I challenge you to find an existing intro stats book that is not linked to technology/software that describes them that way. I certainly have never used that sort of language in my teaching -- even when I teach engineering and computer science students who already know about for loops. I will certainly talk about doing things repeatedly ("now imagine that we take lots of samples of the same size..."), but I don't mention initialization or indexing (two important things in the for-loop framework) or use any words that sound like informal for loops.
As for googling to find code examples, this is a dangerous thing no matter what you choose because (a) coding standards in the R community are not that high globally, and (b) there are multiple programming patterns being used.
Thanks for the updates on what's going on with the do()
function, @rpruim.
With these labs I'm not worried much at all about performance. I think it'll be awhile before these students will be in a scenario when they need to take performance into consideration, and by then I'd guess they'll be more computationally mature so that it'll be no big deal to learn the more efficient formulation.
I agree that it's hard to see a fundamental link between a sampling distribution and the syntax of a for-loop. My main concern is to avoid sampling distributions that just appear at the click of a button or the run of a single line of code, at least at the beginning (this is a problem with both for()
and do()
). To get the idea of (re)sampling variability across, I feel like we can't get any better than shuffling cards by hand in class. It feels like a big jump to go from that to the super-easy computation in R.
Many people bridge this jump with the intermediate step of software with animations - fathom, applets, statkey. This would be doable in these labs with r markdown + shiny, at least for the first time that they see it. Course it would require that they run the lab within their own R session, which would definitely raise the complexity of the suite of fancy tools that we're using in these labs.
@andrewpbray Oohh, some Shiny apps to complement some of the things in the labs would be really nice.
This would be a major rewrite, but (without looking through all the labs) it seems like it'd be possible to remove all references to the vector structure of R and just use a lot of
select()
. An alternative would be to not go whole hog in the dataframe direction and leave some vectors in. But the arguements for the full rewrite:Pros:
subset()
function, there'd only befilter()
andselect()
.Cons:
do()
function in mosaic.One thing that I think we would need to add if we did this is a lab that did focus on vectors, constructing data frames, and manipulating them using tidyr.