Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
25 stars 4 forks source link

delete batch job takes considerable time #763

Closed bossie closed 2 months ago

bossie commented 2 months ago

openeo-geopyspark-integrationtests.tests.test_integration.test_batch_job_delete_job started failing consistently:

RuntimeError: Expected 404 Not Found, but got {'costs': 2, 'created': '2024-04-24T18:44:25Z', 'id': 'j-240424177abe40d6b2f4d90d23b4478b', 'process': {'process_graph': {'aggregatespatial1': {'arguments': {'data': {'from_node': 'filtertemporal1'}, 'geometries': {'coordinates': [[[7, 51.75], [7.1, 51.35], [7.5, 51.3], [7.6, 51.7], [7, 51.75]]], 'type': 'Polygon'}, 'reducer': {'process_graph': {'mean1': {'arguments': {'data': {'from_parameter': 'data'}}, 'process_id': 'mean', 'result': True}}}}, 'process_id': 'aggregate_spatial'}, 'filtertemporal1': {'arguments': {'data': {'from_node': 'loadcollection1'}, 'extent': ['2017-11-01', '2017-11-21']}, 'process_id': 'filter_temporal'}, 'loadcollection1': {'arguments': {'bands': ['NDVI'], 'id': 'PROBAV_L3_S10_TOC_333M', 'spatial_extent': None, 'temporal_extent': None}, 'process_id': 'load_collection'}, 'saveresult1': {'arguments': {'data': {'from_node': 'aggregatespatial1'}, 'format': 'GTIFF', 'options': {}}, 'process_id': 'save_result', 'result': True}}}, 'status': 'finished', 'title': 'test_batch_job_delete_job', 'updated': '2024-04-24T18:50:05Z', 'usage': {'cpu': {'unit': 'cpu-seconds', 'value': 1658}, 'duration': {'unit': 'seconds', 'value': 229}, 'input_pixel': {'unit': 'mega-pixel', 'value': 0.75}, 'memory': {'unit': 'mb-seconds', 'value': 3056117}}}

The job is actually soft-deleted (check EJR index) but with what looks like a considerable delay.

Adding an additional check after 5s did not help:

_verify_job_existence job_id='j-240424177abe40d6b2f4d90d23b4478b' user_id='f689e77d-f188-40ca-b12b-3e278f0ad68f' exists=False backoff=5.0 Verification of job_id='j-240424177abe40d6b2f4d90d23b4478b' user_id='f689e77d-f188-40ca-b12b-3e278f0ad68f' exists=False unsure after 4 attempts

bossie commented 2 months ago

The initial theory was that the creation of a new ES index caused a temporary hiccup and an additional check after 5 seconds was added to reduce the chance of false positives. This was not the case but I'm leaving it in because a max delay of 1.1s might be a bit optimistic and deleting a batch job is not a hot path.

What actually happened is that the new index had no mapping for the deleted field and instead of interpreting said field and omitting the document from the search results, the EJR would just return the document.

A new index with a mapping for deleted has since been created.

bossie commented 2 months ago

Integration tests passed so fixed.