carpentries-lab / python-aos-lesson

Python for Atmosphere and Ocean Scientists
https://carpentries-lab.github.io/python-aos-lesson/
Other
87 stars 49 forks source link

Finish large data lesson #33

Closed DamienIrving closed 3 years ago

DamienIrving commented 3 years ago

For now I've put the new large data lesson in a new development directory (development/10-large-data.md) so it doesn't appear on the website until we've sorted a few issues.

By using OPeNDAP, when we time how long it takes to calculate the daily maximum we are conflating the time it takes to access the data over wifi with the time it takes to actually process the data. On my laptop with the data stored locally, it takes 2 minutes to calculate the daily maximum on 1 core and 8 minutes on 4 cores (see this notebook). (And for some reason I'm still getting NaNs...) Accessing the data via OPeNDAP can mean the calculation takes anywhere from 20 minutes to over an hour depending on internet traffic in my neighborhood (at least I think that's the reason for the variability in how long it takes).

One solution would be for the instructor to have the data stored locally on their laptop. We could add a note in the lesson to tell the instructor to download the required data from their closest ESGF node. We already expect that students will just watch the instructor live code an example rather than run the code themselves, so we can just restrict the exercises to questions that don't require downloading the data. This would mean that for a given instructor on a given computer, the processing time would be repeatable and we could hopefully construct a data processing task that is faster in parallel than in serial (and hopefully also a reasonably short processing time so that it could be done live).

The problem with that solution is that a data processing task that is faster in parallel than serial on one computer (and that runs quickly enough to do live) might not be on another. I guess we could just tell the instrutor to show the lesson notes on the screen rather than running the commands themselves if they aren't able to download/store the data on their laptop or processes it in an acceptable time live.

Thoughts, @hot007?

hot007 commented 3 years ago

I think there's a lot of confounding factors with how we run this. As it is, you're relying on

If we store a copy of the data locally, then we can avoid the network issues, and coincidentally we can then leave dask to choose its own chunk size as we'd no longer be limited by keeping under TDS request limits. However as you say, the instructor needs to be given warning to download the data, it's not exactly small; also minimizing background tasks I believe could make a substantial difference here - an instructor would usually only have their browser, video chat and coding environment open, so that should be okay (background processes presumably consume minimal resources).

I tend to agree, for this lesson we may just need some fairly substantial notes to instructors to run through options themselves in preparation and determine a method that works best for them, and it's possible that the solution is just going through the notes without executing code themselves if they can't get a setup that works fairly quickly. I think that would be the least preferred option though as it means the students don't see the code executing successfully, which potentially diminishes trust/understanding.

I realise looking at the proposed timeline of the training that this might not work, but one possibility if the remote TDS option does work for an instructor but is slow, that the task could be started, and dask dashboard left on the screen to show execution status, while the instructor deals with other issues for ~15min, eg. tea/stretch break, targeted discussion (formative assessment), prep for post-workshop survey, etc.

In general though I think it would be highly advisable for the instructor to have a local copy of the dataset, and if taught outside Australia/Oceania, definitely use a dataset available on the nearest ESGF node!

As for your NaN issue, I'm not sure what to say, other than "it worked for me"! :(