Taking stock of existing efforts and reflecting on directions

Below are existing efforts that I found which could be useful to discuss and define GTSA's clear objective and build its core structure during the Hackweek:

DARTS has a huge collection of methods, and can import Time Series from multiple source including Xarray datasets, even a GP example: https://unit8co.github.io/darts/quickstart/00-quickstart.html#Inferring-missing-values-with-Gaussian-Processes. However, I'm not sure that the methods can "vectorize" across an entire data cube (couldn't find anything on that),
PTSA has some time series methods for EEG (medical): https://pennmem.github.io/ptsa/html/examples/getting_started.html, also built on top of Xarray,
EOFs for spatial and temporal decomposition, also supporting Xarray: https://ajdawson.github.io/eofs/latest/,
GeoWombat's time series module: https://geowombat.readthedocs.io/en/latest/gpu.html. The package is based on Rasterio/Dask directly, without using Rioxarray.

The obvious dependencies that are now more stable:

Rioxarray for applying Rasterio operation/georeferencing to Xarray datasets: https://github.com/corteva/rioxarray,
Geocube to rasterize GeoPandas dataframe on Xarray objects: https://github.com/corteva/geocube,
Maybe in the future, Xvec to manipulate Vector cubes in Xarray: https://github.com/xarray-contrib/xvec.

Apart from GeoWombat's Time Series section, I don't see anything that does what GTSA currently does (scalable spatiotemporal prediction). GeoWombat are also the only ones providing an interface to ingest raster data + chunk it + process it. The limitation is that they have to maintain all these aspects at once in a single package. While GTSA can leave the ingestion + chunking + vector operations to Rioxarray + Geocube for the most part, and focus on making the link to more easily apply scalable method on the processing side. I really like their approach of allowing any PyTorch & other algorithm to be passed, we should probably aim towards something similar.

So, in terms of package objectives, I see two core aspects:

Provide routines built on top of Rioxarray to create temporal raster stacks, with possibly multiple variables, in an out-of-memory fashion from a list of raster with various extent/projections/dates (most of the heavy work is done by Rioxarray). One main issue I see is that rasters don't natively have dates in their metadata, so GTSA would need a generic interface for that (an Xarray accessor that reads dates from most auxiliary files/filenames for raster would be super useful, we want to copy the functionalities from the SatelliteImage class in GeoUtils: https://geoutils.readthedocs.io/en/latest/satimg_class.html, but it'll take a while).
Provide routines to perform spatiotemporal error-aware prediction in a scalable manner: using already implemented methods from SciPy, PyTorch, etc, wherever possible. Here again, creating routines that support most specifically scalable algorithms can be a challenge. For predicting GPs, methods exist to do this such as Batch GPs: https://docs.gpytorch.ai/en/stable/examples/08_Advanced_Usage/index.html#batch-gps). For applying GPs: one only needs a chunk the scale of the covariance, then points are independent! But this is not natively supported in most packages, we'd have to write it. Another challenge would be to provide "error-aware" methods wherever possible, having mostly methods that understand and propagate uncertainties in the prediction. That would allow to later rely on ObsArray (or similar) to use the predicted datasets at different scales! (the GP + OLS + WLS in pyddem all have this!).

In terms of ideal code structure: I'm not sure what is best... Definitely not a class-based object. I feel that an Xarray accessor could maybe work quite nicely? But we'd need to grasp all the implications for out-of-memory ops. For instance:

import gtsa

# The package itself would only be called to open the list of files and stack them out-of-memory to a certain disk location
ds = gtsa.open_rasterstack(list_raster_files=..., zarr_file=...)`
# (or this could be several functions if needed: define different tiling types? areas with different projections?)

# Then the Xarray accessor would do everything else
# For example, define additional Xarray attributes to ensure the time/space units are known, or to store the covariance of the data in space and time (based on ObsArray, maybe, if it takes off)
ds.gtsa.time_unit
ds.gtsa.space_unit

# For prediction: have a fit/apply function that returns predicted values at new spatiotemporal locations
ds_pred = ds.gtsa.predict(method=..., time_pred=..., x_pred=...)
ds_pred.to_zarr(zarr_file=...).

Do you think that would work (even out-of-memory)?

That's all I've got for now :stuck_out_tongue:!

I think this could work! 🙌 🙂

For context - here a reference to our efforts exploring the Gaussian Process regression landscape during the 2023 GeoSMART Hackweek. Future directions and possible development of the GTSA framework remain open for discussion.

...

What you describe with gtsa.open_rasterstack is essentially what is done with this create_stack utility, so further generalizing this to a function is very feasible. Once the zarr stack is created and chunked in a way that is optimal for time series regression, parallel out-of-memory computations are seamless and fast.

Adding a tiling scheme to handle larger areas that span different projections would be a great enhancement! Perhaps Kerchunk might be of some assistance here. Alternatively, if there is a way to compress nodata regions in a zarr stack and perhaps reproject on the fly without incurring significant cost in computation time, it might be possible to create a single Zarr "file". I am skeptical that this won't be costly, but it might be worth testing. Another alternative might be creating a single Zarr file that stores the crs specific data regions as different variables / coordinates / dimensions within the xarray dataset container object... but that could feel messy and should ideally be handled under the hood somehow. The advantage is that dask can efficiently map out its task graph when pointed to a single collection of data objects.... Or we just save the tiles as separate zarr files and go from there 🙂

We can definitely store and access metadata and additional variables within the xarray dataset container structure. Just need to decide what we want to preserve | add | define. Maybe we can look to the Climate and Forecast Conventions or other references for best practices.

In terms of running different computations or predictions along the temporal axis, this is currently done by the gtsa.py utility. Some of this could be generalized in to a function like what you describe with ds.gtsa.predict. Computation won't happen until the result is called upon to display values, generate a plot, write to disk, etc... for example here.

Finally, I think we can leverage the existing xr.apply_ufunc to apply any function along desired dimensions, but perhaps GeoWombat does this differently and more efficiently? The xr.apply_ufunc api can certainly be a bit finicky so happy to explore alternatives!

friedrichknuth / gtsa

Taking stock of existing efforts and reflecting on directions #1