Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
25 stars 4 forks source link

improve persisting and retrieving batch job result metadata #735

Open bossie opened 3 months ago

bossie commented 3 months ago

Batch job j-24031991c78040e482e3a02fd464c3af generated 6 MB of result metadata, most of which is taken up by "derived_from" links (there are 17694); that doesn't fit in a ZNode, resulting in the familiar ZK ConnectionLoss in the job tracker.

In this case the problem might be solved by simply not patching the links: https://github.com/Open-EO/openeo-python-driver/blob/39dfaa415d42fb014bedc84a7a935cb817bca09d/openeo_driver/views.py#L1033

Maybe we should revisit/unify the way batch job result metadata is persisted and retrieved (currently a mix of a ZK/ES document and the job_metadata.json file). This benefits the ZK as well as the EJR case.

Related: https://github.com/Open-EO/openeo-geopyspark-driver/blob/6625156fb59d2de83e3b6d487cf54c6f2a17c526/openeogeotrellis/job_tracker_v2.py#L554

https://github.com/Open-EO/openeo-python-driver/issues/190

soxofaan commented 3 months ago

Indeed. I think this kind of overlaps with this issue:

So the idea would that the job registry should only store pure batch job metadata, and the batch job result data and metadata should be separate from that