load_stac is quite a challenging process (because of the variability of how actual STAC resources can be structured). Moreover it is becoming an important corner stone for quite some use cases and features (cross-backend execution, test suites, ...)
Our load_stac implementation has grown organically but it's becoming a bit challenging to finetune it further. It's currently a single GeoPySparkBackendImplementation method nearing 400 LOC. Writing (unit) tests for it is a high friction endeavor because you can only check the whole pipeline: have a functional API running, submit load_stac with a functional URL and check the resulting geotiff/netcdf.
I'd propose to break things down a bit, so that smaller aspects/phases of load_stac can be tested and managed more easily (without having to juggle with functional dummy API's and inspecting netcdf data).
I think it makes sense to create a new subpackage "namespace", e.g. openeogeotrellis.processes where we can have process-specific submodules that group helpers for non-trivial processes like load_stac (and load_collection too).
Pulling implementation parts out of a GeoPySparkBackendImplementation or GeopysparkDataCube context should simplify the necessary setup work necessary for testing.
load_stac
is quite a challenging process (because of the variability of how actual STAC resources can be structured). Moreover it is becoming an important corner stone for quite some use cases and features (cross-backend execution, test suites, ...)Our
load_stac
implementation has grown organically but it's becoming a bit challenging to finetune it further. It's currently a single GeoPySparkBackendImplementation method nearing 400 LOC. Writing (unit) tests for it is a high friction endeavor because you can only check the whole pipeline: have a functional API running, submit load_stac with a functional URL and check the resulting geotiff/netcdf.I'd propose to break things down a bit, so that smaller aspects/phases of load_stac can be tested and managed more easily (without having to juggle with functional dummy API's and inspecting netcdf data). I think it makes sense to create a new subpackage "namespace", e.g.
openeogeotrellis.processes
where we can have process-specific submodules that group helpers for non-trivial processes like load_stac (and load_collection too). Pulling implementation parts out of a GeoPySparkBackendImplementation or GeopysparkDataCube context should simplify the necessary setup work necessary for testing.