Open smangham opened 1 month ago
Thanks for the feedback @smangham. I've mulled it over for a while and the TL;DR of my thoughts are - yep I think you're right.
I think the episode could start at the "Realistic scenario" section, without losing much.
I went ahead and did a very quick version of this for the workshop we're delivering this week, (currently deployed here, hopefully i'll improve it) and I wouldn't say anything was lost except for a nicer polynomial-shaped dataset (Anscombe II)!
It's a bit better, but I'm not quite happy with it yet.
I think the "goal" of the regression lesson isn't clear and it's probably trying to do too much heavy lifting at once. Off the top of my head its trying to do:
It could probably do with a step-back and re-write from a clearer "data science/hands on" lens, and likely it will bleed into the classification episode too.
We just ran the workshop for the University of Southampton Astronomy & Astrophysics group, and I've just got a few comments.
The regression section is a bit of a hit to the pacing of the workshop. It's stuff that, realistically, most people will already be very familiar with but takes almost an hour, and involves writing a lot of code, most of which is boilerplate, and which is definitely useful for the exercises at the end but the necessity of which isn't clear during the taught section. Whilst emphasising that understanding the stats of your dataset and using functions as building blocks, is important, I think it sort of loses the audience - especially a more statistically/computationally literate one. If the workshop requires a baseline of statistical literacy, I think that'd be better served as a separate, explicit prerequisite workshop.
Plus, the idea of setting up a framework to show how with
sklearn
you can easily train and compare different models is good, but it doesn't quite do that as it creates a lot of bespoke functions for each model.I think the episode could start at the "Realistic scenario" section, without losing much. You could then potentially illustrate how the same basic structure can fit a few different model types, or alternatively focus on errors a bit more (e.g. for a fit, which points are 1-2-3σ off, or what's the 1σ range of fits) to add depth. That then means the first two episodes use the same dataset, instead of switching.
As a side note, I don't think it's necessary for all the figures to include both the fit in green and the predictions of the fit for the X-values of the data as red crosses. I think everyone should be familiar enough with the concept of a fit that simply plotting the fit in red would be sufficient.