Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
26 stars 4 forks source link

phase out ZkJobRegistry #632

Closed bossie closed 8 months ago

bossie commented 9 months ago

Context: ZK on CDSE crashes due to OOM.

soxofaan commented 9 months ago

FYI: related to (and maybe even duplicate of) #498

bossie commented 9 months ago

async_task is only aware of ZkJobRegistry.

In the case of SHub batch processes that's just fine because those weren't/aren't available on CDSE SHub endpoint https://sh.dataspace.copernicus.eu and won't be considered:

https://github.com/Open-EO/openeo-geopyspark-driver/blob/4478f96a3053eda10851cf4afd5cc2dd657e1704/openeogeotrellis/backend.py#L2577-L2578

load_stac of partial job results only works on Terrascope anyway so this changes nothing.

soxofaan commented 9 months ago

FYI: I've also been tagging some implementation parts with todo notes referencing #498

$ grep '#498' openeogeotrellis/ -r -n
openeogeotrellis/job_registry.py:797:        # TODO #236/#498 For now: compare job metadata between Zk and EJR
openeogeotrellis/job_registry.py:910:        # TODO #236/#498: error if both sources failed?
openeogeotrellis/job_registry.py:923:        # TODO #236/#498 Need to have EJR implementation for this? This is only necessary for ZK cleaner script anyway.
openeogeotrellis/job_tracker_v2.py:395:            # TODO: #236/#498 also/instead get jobs_to_track from EJR?
openeogeotrellis/backend.py:334:            # TODO #236/#498 avoid this fallback and just make sure it is always set when necessary
openeogeotrellis/backend.py:1640:            zk_job_registry_factory=ZkJobRegistry,  # TODO #236/#498 allow to disable this with config?
openeogeotrellis/backend.py:1984:            # TODO #498 eliminate ZK code path, or at least encapsulate this logic better
bossie commented 9 months ago

Heh. Persisting usage in ES (#488) doesn't work. From the YARN job_tracker logs:

In context "job_tracker job_metadata.status='finished' from YarnStatusGetter": caught EjrHttpError('EJR API error: 400 \'Bad Request\' on PATCH \'https://jobregistry.openeo.vito.be/jobs/j-240105907a9640a79375dbcdabe66505\': {"statusCode":400,"message":["property costs should not exist"],"error":"Bad Request"}')

Traceback (most recent call last):
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/util/logging.py", line 371, in just_log_exceptions
    yield
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/job_tracker_v2.py", line 512, in _sync_job_status
    self._elastic_job_registry.set_usage(job_id, job_costs, dict(usage))
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/jobregistry.py", line 465, in set_usage
    return self._update(
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/jobregistry.py", line 442, in _update
    return self._do_request("PATCH", f"/jobs/{job_id}", json=data)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/jobregistry.py", line 288, in _do_request
    raise EjrHttpError.from_response(response=response)
openeo_driver.jobregistry.EjrHttpError: EJR API error: 400 'Bad Request' on `PATCH 'https://jobregistry.openeo.vito.be/jobs/j-240105907a9640a79375dbcdabe66505'`: {"statusCode":400,"message":["property costs should not exist"],"error":"Bad Request"}

Was a bit hard to find because there's no job_id in the log entry.

soxofaan commented 9 months ago

so the "costs" field must be supported in EJR (https://github.com/Open-EO/openeo-job-registry-elastic-api) this is something @JanssenBrm or hist team has to add

bossie commented 9 months ago

@soxofaan was the property result_metadata (actually: results_metadata) mentioned in https://github.com/Open-EO/openeo-job-registry-elastic-api/issues/2#issuecomment-1351354363 ever used?

Because this seems to assume that all properties are top level:

https://github.com/Open-EO/openeo-geopyspark-driver/blob/5475236f6edbc9efc4ec0c1fc9c7659631dbadea/openeogeotrellis/job_registry.py#L518-L550

soxofaan commented 9 months ago

no indeed, I don't think results_metadata is already being used (writing nor reading)

JanssenBrm commented 9 months ago

Is it possible to create an issue on https://github.com/Open-EO/openeo-job-registry-elastic-api with clear instructions on what should be done?

bossie commented 8 months ago

Ugh Github, stop closing issues upon merging one of several PRs.

bossie commented 8 months ago

Still need to get the jobs to be tracked from EJR instead of ZK (context: https://github.com/Open-EO/openeo-geopyspark-driver/pull/638).

Potential quickfix https://github.com/Open-EO/openeo-geopyspark-driver/pull/639 was introduced in the meanwhile.

bossie commented 8 months ago

Disabled ZkJobRegistry on CDSE dev/staging; seems to work after a force_redeploy.

bossie commented 8 months ago

seems to work

*sometimes.

Not production ready because of https://github.com/Open-EO/openeo-job-registry-elastic-api/issues/32.

bossie commented 8 months ago

ES mapping of dependencies and results_metadata should be changed (to flattened?) to avoid a mapping explosion.

This should also fix integration tests test_advanced_cloud_masking_diy and test_load_collection_references_correct_batch_process_id.

bossie commented 8 months ago

For reference: https://github.com/Open-EO/openeo-job-registry-elastic-api/issues/33

bossie commented 8 months ago

Rolled back on CDSE dev/staging because of https://github.com/Open-EO/openeo-job-registry-elastic-api/issues/33 (#660).