Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
26 stars 5 forks source link

batch_job.py `write_metadata`: avoid ad-hoc file selection for upload #940

Open soxofaan opened 1 week ago

soxofaan commented 1 week ago

https://github.com/Open-EO/openeo-geopyspark-driver/blob/b63280c5fc5b928a0d231ab1aec6e3b47b4b9c36/openeogeotrellis/deploy/batch_job.py#L511-L519

Here we're building an ugly ad-hoc deny-list for "files" that should not be uploaded to S3

As mentioned in the TODO, we should use an explicit asset list to upload instead of blindly assuming everything from the job dir should be uploaded (minus some hand-picked exceptions)

soxofaan commented 1 week ago

cc @EmileSonneveld

soxofaan commented 1 week ago

As an illustration that this does not scale:

if UDF_PYTHON_DEPENDENCIES_FOLDER_NAME in str(file_path):

doesn't even work as UDF_PYTHON_DEPENDENCIES_FOLDER_NAME is not in play anymore on CDSE since #845

EmileSonneveld commented 1 week ago

In export_workspace list of files that exists locally an on s3 is determined by the list of stac metadata files. For example colection.json + item.tiff.json. Probably the same can be used here

EmileSonneveld commented 1 week ago

This issue is a direct result of changes introduced in https://github.com/Open-EO/openeo-geopyspark-driver/issues/877

EmileSonneveld commented 3 days ago

Logged what files do get uploaded. All logs on cdse dev where one of the following 2:

Writing results to object storage. paths=[PosixPath('/batch_jobs/j-XXX/job_specification.json'), PosixPath('/batch_jobs/j-XXX/job_metadata.json'), PosixPath('/batch_jobs/j-XXX/openEO_2017-03-07Z.tif'), PosixPath('/batch_jobs/j-XXX/openEO_2017-03-07Z.tif.aux.xml'), PosixPath('/batch_jobs/j-XXX/openEO_2017-03-07Z.tif.json'), PosixPath('/batch_jobs/j-XXX/collection.json')]

Writing results to object storage. paths=[PosixPath('/batch_jobs/j-XXX/job_specification.json'), PosixPath('/batch_jobs/j-XXX/job_metadata.json')]