carpentries-incubator / geospatial-python

Introduction to Geospatial Raster and Vector Data with Python
https://carpentries-incubator.github.io/geospatial-python/
Other
155 stars 57 forks source link

Episode on data access and parallelization #86

Closed fnattino closed 2 years ago

fnattino commented 2 years ago

This is work-in-progress to address #82 .

@rbavery: I have added a notebook with a first sketch of how the episode on data access/parallelization could look like, any feedback is more than welcome!

rbavery commented 2 years ago

This is looking awesome!

Accessing Data Episode

I like the Objectives of this lesson. I think we can potentially split out the Process satellite images in "chunk" to take advantage of parallelization. into it's own lesson. This would mean we'd have a lesson focused on Data Access, which ends around this cell

import rioxarray

# ... or we can open them directly (and stream content only when necessary)
blue_band_href = assets["B02"].href
blue_band = rioxarray.open_rasterio(blue_band_href)
blue_band

and a separate Parallalizing Raster computation with Dask lesson.

I think the final cell for the Access Data Episode could be saving out the raster with rioxarry. this would involve reassigning the CRS to the mosaicked xarray DataArray we produced with stackstac and then using the .rio.toraster method. we can borrow from this example my colleague @alexmandel worked on https://github.com/PacificCommunity/DigitalEarthPacific/blob/demo/writeraster/notebooks/demo/cloudless-mosaic-sentinel2.ipynb

Parallalizing Raster computation with Dask

I love that you already cover guidelines on how to set the chunk size! An additional topic to cover here could be how to tell if your code is running faster with dask or without dask. For this we could cover using time, the dask profiler, or some other easily accessible profiling tool in jupyter notebooks. I think we should also have a section describing dask's lazy computation mode and how to take advantage of that to inspect metadata prior to downloading the actual scene data.

For the Raster calculations portion, instead of Raster calculations using stackstac I suggest we show how to mosaic a collection of scenes. there's stackstac's internal method which just flattens: https://stackstac.readthedocs.io/en/latest/api/main/stackstac.mosaic.html#stackstac.mosaic

I think it would be valuable to show that solution and for a median composite.

Setup instructions will also need to be updated with new dependencies. I've seen the most success with not pinning specific versions to allow a more flexible solve for different machines: https://carpentries-incubator.github.io/geospatial-python/setup.html

A third episode focused on working with a cool looking mosaic could focus on xarray-spatial's raster calc funcs. One idea: computing spectral indices, thresholding them, and polygonizing the result (maybe areas with especially high NDVI): https://github.com/makepath/xarray-spatial

rbavery commented 2 years ago

I also like the inclusion of the Dask task graph image. including other images of intermediate results, such as plots of the blue band, could be good to include prior to the final challenge. Also when this gets formatted to the lesson markdown, I think we can create a set of tooltips that refer to other sources for folks to read up on COG, STAC, and Dask, while also briefly summarizing their utility for geospatial.

fnattino commented 2 years ago

Hi @rbavery , I have created a first version of a full data access episode. Basically, I have converted the Jupyter notebook that you already had a look at into a .md file and I have added some explanatory text in between the code blocks. Whenever you have time to review it, I would be happy to have any kind of feedback - thanks in advance!

I have also added a first exercise following up on your idea to have participants exploring a STAC catalog even before having the search tool introduced - what do you think about having it formulated in this way?

Still working on the second episode (on parallel raster computation with Dask).

rbavery commented 2 years ago

@fnattino thanks I'll give this a review this evening

rbavery commented 2 years ago

@fnattino thanks for addressing these reviews! once this data access episode is finished, can we merge that PR and finish the parallelization episode in a separate PR? Feel free to merge this as is now, I or somebody could add a challenge later unless you are already working on it.

fnattino commented 2 years ago

Hi @rbavery - thanks a lot for having already a look. I am finishing up the last challenge, I'll ping you as soon as I have pushed it!

fnattino commented 2 years ago

Hi @rbavery, this is it - I have added the final challenge.

I have also updated the setup instructions and the environment.yaml file, adding pystac_client to the dependencies.

Merging this first and opening a second PR for the parallelisation episode sounds good - I have removed the corresponding notebook from this branch.

One last thing: should this become episode 19? I could set the number and merge if this is alright with you. Really thanks a lot for all the feedback and suggestions!

rbavery commented 2 years ago

Fantastic!!! Yes let's make this episode 19 for now. Really looking forward to teaching this! Lgtm feel free to merge.