ZGIS / semantique

Semantic Querying in Earth Observation Data Cubes
https://zgis.github.io/semantique/
Apache License 2.0
16 stars 6 forks source link

feat: Add STACCube :gift: #38

Closed fkroeber closed 5 months ago

fkroeber commented 7 months ago

Description

This PR adds the STACCube class to the datacube.py. This is targeted at ad-hoc data cubes built from the results of a STAC metadata search. Contrary to the Opendatacube and the Geotiffarchive, it doesn't require pre-organising the data (e.g. to ingest the data into a database in case of the Opendatacube or to create a temporally stacked geotiff in case of the Geotiffarchive). Instead, the STACCube contains a retriever that knows how to fetch assets linked in STAC search results into a data cube. The further usage is demonstrated in the datacube.ipynb notebook.

Type of change

Checklist

luukvdmeer commented 7 months ago

This looks like a very useful addition, thanks! @loreabad6 can I add you as a reviewer?

fkroeber commented 7 months ago

The modifications refer to performance & functionality improvements realised via...

  1. code refactoring: pulling data first from the STAC linked data src prior to daily resampling (group_by_solar_day)
  2. introducing dtype parameters to allow data retrieval as integers instead of floats
  3. adding dask_params to enable control over data retrieval (handled internally via stackstac library)
loreabad6 commented 7 months ago

This looks like a very useful addition, thanks! @loreabad6 can I add you as a reviewer?

Sure, I started checking and will continue tomorrow

fkroeber commented 7 months ago

Thanks @loreabad6 for taking time to review the changes! To refer to some comments/open questions...

  1. introducing dtype parameters to allow data retrieval as integers instead of floats

The integer handling is useful for using the SLC, but, from my tests, when working with the reflectance values we need to load as float, otherwise it does not work. I am a bit confused about that since the documented parameter says it will change it to float... in any case, does it then make sense to support int types?

Yes, when working with reflectance values, floats are necessary and not just for the recipe evaluation but already when fetching the data via STACCube. So in this case, the default paramter (float) should not be changed to int (maybe I should make this more clear in the documentation). The integer handling is really just to speed up the data fetching for layers that are natively in integer format.

  1. adding dask_params to enable control over data retrieval (handled internally via stackstac library)

out of curiosity, xarray handles dask internally, right? have you experimented how semantique handles the chunk and data loading? is this just pass onto xarray or does semantique anyway request all the data, making the chunking in dask irrelevant?

Currently, semantique doesn't support dask parallelisation. The introduced dask_params for the STACCube really only refer on the data fetching part. Everything else (i.e. the main part of recipe execution) is not parallelised yet.

As discussed, dask offers two main ways of parallelisation. Low-level xarray parallelisation (what you are refering to) and some high-level wrapper functionality (allowing to parallelise any function, similar to the working mechanism of map functions). Low-level parallelisation requires the xarray functions to support dask (currently not consistently ensured in semantique). I am currently working on a solution to have at least some workaround via high-level parallelisation to enhance the scalability of semantique. But more on this, in a different PR ;)

fkroeber commented 7 months ago

The minor fix regarding pixel alignment refers to the following issue with stackstac. With the modified parameters the retrieved data should now have exactly the same bounds as provided by the SpatialExtent object.