Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
26 stars 5 forks source link

Manage technical debt of load_stac #612

Open soxofaan opened 11 months ago

soxofaan commented 11 months ago

load_stac is quite a challenging process (because of the variability of how actual STAC resources can be structured). Moreover it is becoming an important corner stone for quite some use cases and features (cross-backend execution, test suites, ...)

Our load_stac implementation has grown organically but it's becoming a bit challenging to finetune it further. It's currently a single GeoPySparkBackendImplementation method nearing 400 LOC. Writing (unit) tests for it is a high friction endeavor because you can only check the whole pipeline: have a functional API running, submit load_stac with a functional URL and check the resulting geotiff/netcdf.

I'd propose to break things down a bit, so that smaller aspects/phases of load_stac can be tested and managed more easily (without having to juggle with functional dummy API's and inspecting netcdf data). I think it makes sense to create a new subpackage "namespace", e.g. openeogeotrellis.processes where we can have process-specific submodules that group helpers for non-trivial processes like load_stac (and load_collection too). Pulling implementation parts out of a GeoPySparkBackendImplementation or GeopysparkDataCube context should simplify the necessary setup work necessary for testing.

bossie commented 11 months ago

Related: https://github.com/Open-EO/openeo-geopyspark-driver/issues/528#issuecomment-1744438343