carpentries-incubator / geospatial-python

Introduction to Geospatial Raster and Vector Data with Python
https://carpentries-incubator.github.io/geospatial-python/
Other
155 stars 57 forks source link

Episode on parallel raster computations #90

Closed fnattino closed 2 years ago

fnattino commented 2 years ago

This episode will address the second part of #82, including the initial feedback from @rbavery on #86 .

fnattino commented 2 years ago

Hi @rbavery, sorry for taking so long with this episode. Would you be willing to give it a look and let me know your general impression?

I struggled a bit to find an example that would be fast enough to run locally (especially with the data downloading), but also intensive enough to show some effects of parallelisation. Ultimately, I have thought about having a comparison between a serial calculation and its parallel version where one sees how parallelisation can both help in some part and leave timings unaffected in some others. I think this could still be a relevant element to teach.

I have also dropped the memory profiling as it seemed a bit tricky to monitor, and thought the episode was already quite dense in content.

I might still add a final exercise on stackstac for participants to apply what they have learned with chunking/lazy execution in a slightly different context..

rbavery commented 2 years ago

Thanks @fnattino , reviewing this now.

I might still add a final exercise on stackstac for participants to apply what they have learned with chunking/lazy execution in a slightly different context..

A colleague of mine, @srmsoumya is working on a stackstac lesson #102 so maybe we can introduce stackstac there instead of including it in this episode? And then that stackstac episode could build off of this episode?

fnattino commented 2 years ago

Thanks a lot for the feedback @rbavery !

A colleague of mine, srmsoumya is working on a stackstac lesson https://github.com/carpentries-incubator/geospatial-python/pull/102 so maybe we can introduce stackstac there instead of including it in this episode? And then that stackstac episode could build off of this episode?

Great, sounds like a plan then!

fnattino commented 2 years ago

Thanks a lot @SarahAlidoost for the detailed feedback, I have included basically all your suggestions. Just let me know if you think the header questions are now clearer.

Also @rbavery do you want to have a second look or you feel your points have been addressed? No rush at all, just want to make sure I do not merge before you have time to have a final check!

rbavery commented 2 years ago

@fnattino I'm in favor of moving all the following info identified by @SarahAlidoost to a callout. I find it useful and important but agree it could distract from the main objectives of the lesson if included in the main text.

we can skip the line which leads to chunks 72 MB large: (1 x 6144 x 6144) elements, 2 bytes per element (the data type is unsigned integer uint16), i.e., 6144 x 6144 x 2 / 2^20 = 72 MB In particular I found this calculation useful to show folks how array side substantially affects memory use.

we can skip this line By default, Dask parallelizes operations on the CPUs that are available on the same machine, but it can be configured to dispatch tasks on large compute clusters. I think it's useful to highlight this in a callout. more campuses and other orgs are providing access to Dask on clusters so it'd be good to let people know that Dask is an option that can stay with them even for jobs requiring clusters.

perhaps we can skip the explanations about Dask graph or move it to a callout as a piece of additional information. I'm in favor of the callout. looking at the complexity of the dask task graph can be a good, quick proxy for understanding why Dask doesn't lead to performance boosts.

rbavery commented 2 years ago

I think these items identified by @SarahAlidoost should all be moved to a callout as they contain useful info

we can skip the line which leads to chunks 72 MB large: (1 x 6144 x 6144) elements, 2 bytes per element (the data type is unsigned integer uint16), i.e., 6144 x 6144 x 2 / 2^20 = 72 MB

shows folks how array size affects memory usage, gives them a point of reference that can help them when working with large rasters/arrays in memory

we can skip this line By default, Dask parallelizes operations on the CPUs that are available on the same machine, but it can be configured to dispatch tasks on large compute clusters.

shows folks that Dask can still help them for cluster-sized jobs. more universities and orgs are offering Dask clusters via jupyterhubs or other services so I think this is good to highlight.

perhaps we can skip the explanations about Dask graph or move it to a callout as a piece of additional information.

I think the task graph can be a good, quick proxy for whether a set of tasks is too complex for Dask to provide benefit and helps illustrate how Dask organizes lazy computation, in favor of callout

in any case @fnattino feel free to merge looks great!

fnattino commented 2 years ago

Thanks again for your feedback @SarahAlidoost and @rbavery !

rbavery commented 2 years ago

woops, looks like I provided duplicate feedback here, thought the first comment didn't make it. Great work!