Open rhugonnet opened 1 year ago
I think this could work! 🙌 🙂
For context - here a reference to our efforts exploring the Gaussian Process regression landscape during the 2023 GeoSMART Hackweek. Future directions and possible development of the GTSA framework remain open for discussion.
...
What you describe with gtsa.open_rasterstack
is essentially what is done with this create_stack utility, so further generalizing this to a function is very feasible. Once the zarr stack is created and chunked in a way that is optimal for time series regression, parallel out-of-memory computations are seamless and fast.
Adding a tiling scheme to handle larger areas that span different projections would be a great enhancement! Perhaps Kerchunk might be of some assistance here. Alternatively, if there is a way to compress nodata regions in a zarr stack and perhaps reproject on the fly without incurring significant cost in computation time, it might be possible to create a single Zarr "file". I am skeptical that this won't be costly, but it might be worth testing. Another alternative might be creating a single Zarr file that stores the crs specific data regions as different variables / coordinates / dimensions within the xarray dataset container object... but that could feel messy and should ideally be handled under the hood somehow. The advantage is that dask can efficiently map out its task graph when pointed to a single collection of data objects.... Or we just save the tiles as separate zarr files and go from there 🙂
We can definitely store and access metadata and additional variables within the xarray dataset container structure. Just need to decide what we want to preserve | add | define. Maybe we can look to the Climate and Forecast Conventions or other references for best practices.
In terms of running different computations or predictions along the temporal axis, this is currently done by the gtsa.py utility. Some of this could be generalized in to a function like what you describe with ds.gtsa.predict
. Computation won't happen until the result is called upon to display values, generate a plot, write to disk, etc... for example here.
Finally, I think we can leverage the existing xr.apply_ufunc to apply any function along desired dimensions, but perhaps GeoWombat does this differently and more efficiently? The xr.apply_ufunc
api can certainly be a bit finicky so happy to explore alternatives!
Below are existing efforts that I found which could be useful to discuss and define GTSA's clear objective and build its core structure during the Hackweek:
The obvious dependencies that are now more stable:
Apart from GeoWombat's Time Series section, I don't see anything that does what GTSA currently does (scalable spatiotemporal prediction). GeoWombat are also the only ones providing an interface to ingest raster data + chunk it + process it. The limitation is that they have to maintain all these aspects at once in a single package. While GTSA can leave the ingestion + chunking + vector operations to Rioxarray + Geocube for the most part, and focus on making the link to more easily apply scalable method on the processing side. I really like their approach of allowing any PyTorch & other algorithm to be passed, we should probably aim towards something similar.
So, in terms of package objectives, I see two core aspects:
SatelliteImage
class in GeoUtils: https://geoutils.readthedocs.io/en/latest/satimg_class.html, but it'll take a while).In terms of ideal code structure: I'm not sure what is best... Definitely not a class-based object. I feel that an Xarray accessor could maybe work quite nicely? But we'd need to grasp all the implications for out-of-memory ops. For instance:
Do you think that would work (even out-of-memory)?
That's all I've got for now :stuck_out_tongue:!