Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
26 stars 5 forks source link

JobTracker runs on Terrascope prod take much longer #902

Closed bossie closed 1 month ago

bossie commented 1 month ago

JobTracker runs on the production instance of Terrascope take much longer than usual: around 10-15 minutes. The degradation seems to coincide with the recent promotion of both the web app and the JobTracker.

bossie commented 1 month ago

How it unfolded was as such:

  1. users experienced batch job statuses not being updated. This was because the OpenEO web app was upgraded from 20240911-2965 to 20241008-3057; the latter does not make use of the ZkJobRegistry anymore and instead only stores its jobs in the EJR. Because the JobTracker on Nifi was not upgraded at this point, it was still looking in ZK for new jobs but found none to update.
  2. Jobtracker was upgraded from 20240628-2779 to 20241008-3057. This made batch job status updates work again but apparently at a much lower rate than before.
bossie commented 1 month ago

Rolled back both web app and JobTracker to 20240911-2965 and batch job updates are back to their normal rate.

A side effect is that users temporarily lose jobs submitted since the promotion to 20241008-3057 yesterday morning because they are only in EJR and not in ZK.

bossie commented 1 month ago

From the logs of run_id 21d02574-d9f8-4385-8723-765687702410:

EJR Request POST /jobs/search: end 2024-10-09 13:15:11.047009, elapsed 0:00:02.032316 JobTracker.update_statuses: end 2024-10-09 13:28:12.757844, elapsed 0:13:03.743404 JobTracker.update_statuses stats: {"collected jobs": 14547, "job with previous_status='running'": 436, "get metadata attempt": 14547, "skip: app not found": 14547, "job with previous_status='queued'": 14107, "job with previous_status='created'": 4}

from which I derive that:

bossie commented 1 month ago

Related: https://github.com/Open-EO/openeo-geopyspark-driver/pull/638

ZkJobRegistry has user_limit and regular removal of old jobs to speed up JobTracker runs.

soxofaan commented 1 month ago

As a first quick fix, I added "max_age" support to listing "trackable" jobs :

currently waiting for integration tests pipeline to get this in the docker images. Will be build 3063 or higher

bossie commented 1 month ago

Upgraded prod web app and JobTracker to 20241010-3064. Seems to work!

I'll add a TODO re: filtering on updated instead of created @soxofaan, should the need arise (probably not).

soxofaan commented 1 month ago

I also think we should merge list_trackable_jobs and list_active_jobs. It only adds to the confusion to have both I think.

I didn't do that yet in work above to not over-complicate the quickfix

bossie commented 1 month ago

I also think we should merge list_trackable_jobs and list_active_jobs. It only adds to the confusion to have both I think.

Yeah, can't find any compelling reasons to keep both.

soxofaan commented 1 month ago

(lowered priority of this ticket, as the "critical" part is handled)

soxofaan commented 1 month ago

I added has_application_id option to list_active_jobs and removed list_trackable_jobs now

soxofaan commented 1 month ago

I now also switched from checking "created" date to "updated" date with 5b21a985d70c2b3e376e0cee3cdb20e350b0ce37 (forgot to properly link back from commit message apparently)

soxofaan commented 1 month ago

which should close this ticket