Add rechunking example - Githubissues

jrbourbeau commented 4 months ago

This example reads in 1 TB worth of NVM data, rechunks it to be optimized for time selections, and then writes the rechunked dataset to S3 (in oss-scratch-space in us-east-1).

cc @mrocklin. Happy to keep iterating

review-notebook-app[bot] commented 4 months ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

mrocklin commented 4 months ago

Playing now. Neat. Some thoughts!

It might make sense to arrange the data to make spatial access cheap.

I think that the most common situation I've heard from people is "My satellite pumps out one file every day/hour, so it's organized by time, but I want it organized spatially, so that I can pick out a timeseries for a lat/lon pair really easily.
Maybe at the end we can open up the data with just zarr/xarray without Dask, and show that it's really cheap to get these timeseries, for example from a web application (what they seem to all want to do). I'm actually a little curious about sub-chunk access times. It may be that we want to store the zarr array with far finer chunking than Dask would want so that we're not accessing a bunch of neighboring lat/lon pairs at once. Maybe Xarray does this by default, but maybe not. My hope is that we could show ~100ms access times for little tiny timeseries'.
Thoughts on combining this into the geospatial notebook? I can imagine that in many cases it'll be nice to go from one example to the next, and I wouldn't mind consolidating example notebooks a little.

mrocklin commented 4 months ago

Oh, I guess the rechunking isn't very impressive though, because it's mostly chunked in this way already ...

Maybe we keep with time-optimized then but maybe some of the other feedback still holds?

jrbourbeau commented 4 months ago

@mrocklin you made some changes offline to this notebook -- want to push up those changes here, or to a different PR (whichever is easiest)?

mrocklin commented 4 months ago

I've merged your rechunk example to the xarray example.

jrbourbeau commented 4 months ago

Thanks @mrocklin -- I pushed up one minor update in https://github.com/coiled/examples/pull/49

coiled / examples

Add rechunking example #47