asascience-open / xarray-subset-grid

Subset Xarray datasets in space
BSD 3-Clause "New" or "Revised" License
5 stars 2 forks source link

CERA approach for chunking STOFS data #29

Open AtiehAlipour-NOAA opened 3 weeks ago

AtiehAlipour-NOAA commented 3 weeks ago

CERA uses a code to chunk STOFS .nc files before visualization, which makes it more efficient. Perhaps we can implement the same code before subsetting STOFS data. The code is in a private repository, but I have access to it and the permission to exclusively share it with the STOFS Subsetting Tool development team.

ChrisBarker-NOAA commented 3 weeks ago

It's on our "future" list to look into performance and chunking, so this is great.

The challenge, IIUC, is that to rechunk the data, you need to make a copy of it -- and that can be pretty expensive.

Potentially, the goal could be for STOFS (and other OFSs!) to be re-chunked before being uploaded to the NODD (or even in the original output).

The challenge with that is that an optimum chunking strategy is different depending on the use case, so there may not be a consensus on one "best" way to chunk the data.

Also -- for an unstructured grid, the ordering of the nodes can have a big impact -- does the CERA code reorder the nodes, in addition to re-chunking?

AtiehAlipour-NOAA commented 3 weeks ago

I agree that copying the file might not be a good idea, but I thought if working with STOFS data was slow, that might be an idea. I also heard in that meeting that they transpose the dimension files before chunking the data, but I couldn't figure that out from the code. I do not think they do reordering of the nodes. We might find some relevant material in the JRC code: https://github.com/asascience-open/xarray-subset-grid/issues/19

AtiehAlipour-NOAA commented 3 weeks ago

This is also a relevant library that @SorooshMani-NOAA has shared: https://medium.com/pangeo/rechunker-the-missing-link-for-chunked-array-analytics-5b2359e9dc11