carpentries-incubator / targets-workshop

Pre-alpha {targets} workshop
https://carpentries-incubator.github.io/targets-workshop/
Other
33 stars 6 forks source link

Crew-based execution #11

Closed multimeric closed 10 months ago

multimeric commented 1 year ago

Hi, thanks for writing this workshop. I'm interested in presenting a version of it (which I will probably fork). One thing I want to add for my own version is the new-style batch execution via Crew, which is demonstrated here in the user guide. Would you accept a PR to switch the future based execution to crew-based multiprocessing? This would then allow me to add a further section on HPC execution which also uses crew (but is probably out of scope for what you want).

joelnitta commented 1 year ago

Thanks @multimeric for the idea! I had heard from Will about crew but did not have the chance to make the switch before teaching the first workshop. I think refactoring the workshop to use crew instead of future is good, since the user guide clearly shows it works as a drop-in replacement for local parallelization. I think an additional episode on HPC execution would be OK, as long as it is clear that it is an advanced topic that only should be taught if the participants need that sort of thing. BTW I am considering splitting branching and parallel processing into two separate episodes (https://github.com/joelnitta/targets-workshop/issues/8), so please keep that in mind when you submit your PR (it could resolve #8 in addition to this).

joelnitta commented 1 year ago

Also, BTW I submitted this lesson to the Carpentries Incubator, so at some point soon the organization for the repo may change. Just a heads-up.

multimeric commented 1 year ago

That all sounds good! I actually agree that branching could be a separate topic because it's a bit orthogonal to batch execution. I might wait a little for you to do that refactoring, and then I'll put in a PR for crew and then subsequently for HPC if that's okay.

joelnitta commented 1 year ago

@multimeric Branching and parallel processing have been split up, and the parallel processing episode now uses a palmerpenguins example to minimize context-switching for participants.

https://github.com/joelnitta/targets-workshop/pull/12

multimeric commented 1 year ago

I'm working on the HPC episode. I'd just like to think of a technique that we can apply to the Palmer's Penguins data that will take a bit longer without using sleeps, so it's more of a real life example. I was thinking of training a classification forest to predict species, maybe using a grid search to further slow it down. What do you think of that idea?

joelnitta commented 1 year ago

I agree that we should make the examples as close to "real-life" as practically possible, but while balancing the additional mental load placed on learners. In this case the classification model to me sounds like a pretty advanced topic, if you aren't already familiar with that kind of thing. I would prefer to keep it to lm() for this lesson (which still may be a bit of a stretch for people who haven't used it before).

multimeric commented 1 year ago

Yeah, I understand why you would want that. The issue is that lm is always so fast that we will always have to artificially slow it down. Even lasso regression which I just tried is no slower.

joelnitta commented 1 year ago

I suppose you could add an instructor note with code for an alternative. If both the instructor and participants are sufficiently familiar with classification models in R they should be able to use it.

multimeric commented 1 year ago

Do they need to understand the details of the model? Could I e.g. use a regression tree with the same goal as lm but that is slower?

joelnitta commented 1 year ago

Do they need to understand the details of the model?

I don't think so. The "typical learner" here is somebody doing data analysis, but not necessarily with linear models (or models at all for that matter). I just included a model to try and make things realistic and interesting. The end result (the sign of the slope differs depending on whether you take species into account or not) is interesting and works well with linear models.

Could I e.g. use a regression tree with the same goal as lm but that is slower?

If the implementation is straight-forward, possibly. I personally don't have experience with this topic so it's hard for me to say. But if you can run it with one function, and obtain the output needed for plotting also with a small number of functions (the current lesson uses broom::tidy() and broom::glimpse()) I think it could possibly work.

multimeric commented 1 year ago

This is still something I think would be useful, but I've realised that the current linear modelling "problem" is so easy that very difficult to find a natural extension to it that is slow enough to demonstrate performance considerations. I think it might make sense to use an ML problem throughout the workshop instead. Anyway in the short term I'll keep the current problem for the sake of a minimal PR.