Closed jrbourbeau closed 4 months ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Playing now. Neat. Some thoughts!
It might make sense to arrange the data to make spatial access cheap.
I think that the most common situation I've heard from people is "My satellite pumps out one file every day/hour, so it's organized by time, but I want it organized spatially, so that I can pick out a timeseries for a lat/lon pair really easily.
Maybe at the end we can open up the data with just zarr/xarray without Dask, and show that it's really cheap to get these timeseries, for example from a web application (what they seem to all want to do). I'm actually a little curious about sub-chunk access times. It may be that we want to store the zarr array with far finer chunking than Dask would want so that we're not accessing a bunch of neighboring lat/lon pairs at once. Maybe Xarray does this by default, but maybe not. My hope is that we could show ~100ms access times for little tiny timeseries'.
Thoughts on combining this into the geospatial notebook? I can imagine that in many cases it'll be nice to go from one example to the next, and I wouldn't mind consolidating example notebooks a little.
Oh, I guess the rechunking isn't very impressive though, because it's mostly chunked in this way already ...
Maybe we keep with time-optimized then but maybe some of the other feedback still holds?
@mrocklin you made some changes offline to this notebook -- want to push up those changes here, or to a different PR (whichever is easiest)?
I've merged your rechunk example to the xarray example.
Thanks @mrocklin -- I pushed up one minor update in https://github.com/coiled/examples/pull/49
This example reads in 1 TB worth of NVM data, rechunks it to be optimized for time selections, and then writes the rechunked dataset to S3 (in
oss-scratch-space
inus-east-1
).cc @mrocklin. Happy to keep iterating