Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
25 stars 4 forks source link

add timeout to requests towards ETL API #690

Closed bossie closed 4 months ago

bossie commented 4 months ago

JobTracker was hanging on Terrascope and CDSE and had to be killed. Last line in the logs was:

{
  "message": "logging resource usage {'jobId': 'j-2402222929774711bcc7b5414431dae3', 'jobName': 'CH4', 'executionId': 'a-c9e77e8cd4f74ecf85ab9db984d9f24f', 'userId': '6c19184e-dc90-48bf-8eb4-0a9a74e992e0', 'sourceId': 'cdse', 'orchestrator': 'openeo', 'jobStart': 1708620563000.0, 'jobFinish': 1708622583000.0, 'idempotencyKey': 'a-c9e77e8cd4f74ecf85ab9db984d9f24f', 'state': 'FINISHED', 'status': 'UNDEFINED', 'metrics': {'cpu': {'value': 3600, 'unit': 'cpu-seconds'}, 'memory': {'value': 7372800.0, 'unit': 'mb-seconds'}, 'time': {'value': 2020000.0, 'unit': 'milliseconds'}, 'processing': {'value': 307.1666758209467, 'unit': 'shpu'}}} at https://marketplace-cost-api-prod-warsaw.dataspace.copernicus.eu",
  "levelname": "DEBUG",
  "name": "openeogeotrellis.integrations.etl_api",
  "created": 1708779120.3757954,
  "filename": "etl_api.py",
  "lineno": 127,
  "process": 1,
  "job_id": "j-2402222929774711bcc7b5414431dae3",
  "user_id": "6c19184e-dc90-48bf-8eb4-0a9a74e992e0"
}

Adding a timeout to the requests towards the ETL API should unblock JobTracker.

Note: this does not solve the underlying problem; when the timeout is reached, the batch job succeeds but the user might not be charged.

bossie commented 4 months ago

Suggestion by @soxofaan: retry ETL API requests.

https://github.com/eu-cdse/openeo-cdse-infra/issues/41 made it possible to retry ETL API requests without the risk of charging the user multiple times. The underlying problem was a large process graph that couldn't fit in the job's ZNode; this prevented the job from being marked as completed so it would be picked up again in subsequent JobTracker runs and the user would be charged again.

So the suggestion is about retries within a particular JobTracker run rather than across JobTracker runs and still makes sense.

bossie commented 4 months ago

ETL API requests should already be retried in sync requests and batch jobs because of respectively:

https://github.com/Open-EO/openeo-geopyspark-driver/blob/c390aaf5e6fbbf06c8a990e4cad22443f44808a0/openeogeotrellis/backend.py#L1453-L1468

and

https://github.com/Open-EO/openeo-geopyspark-driver/blob/c390aaf5e6fbbf06c8a990e4cad22443f44808a0/openeogeotrellis/job_costs_calculator.py#L105-L116

bossie commented 4 months ago

Could use a test.