Closed bossie closed 1 month ago
How it unfolded was as such:
ZkJobRegistry
anymore and instead only stores its jobs in the EJR. Because the JobTracker on Nifi was not upgraded at this point, it was still looking in ZK for new jobs but found none to update.Rolled back both web app and JobTracker to 20240911-2965 and batch job updates are back to their normal rate.
A side effect is that users temporarily lose jobs submitted since the promotion to 20241008-3057 yesterday morning because they are only in EJR and not in ZK.
From the logs of run_id
21d02574-d9f8-4385-8723-765687702410:
EJR Request
POST /jobs/search
: end 2024-10-09 13:15:11.047009, elapsed 0:00:02.032316 JobTracker.update_statuses: end 2024-10-09 13:28:12.757844, elapsed 0:13:03.743404 JobTracker.update_statuses stats: {"collected jobs": 14547, "job with previous_status='running'": 436, "get metadata attempt": 14547, "skip: app not found": 14547, "job with previous_status='queued'": 14107, "job with previous_status='created'": 4}
from which I derive that:
Related: https://github.com/Open-EO/openeo-geopyspark-driver/pull/638
ZkJobRegistry
has user_limit
and regular removal of old jobs to speed up JobTracker runs.
As a first quick fix, I added "max_age" support to listing "trackable" jobs :
currently waiting for integration tests pipeline to get this in the docker images. Will be build 3063 or higher
Upgraded prod web app and JobTracker to 20241010-3064. Seems to work!
I'll add a TODO re: filtering on updated
instead of created
@soxofaan, should the need arise (probably not).
I also think we should merge list_trackable_jobs
and list_active_jobs
. It only adds to the confusion to have both I think.
I didn't do that yet in work above to not over-complicate the quickfix
I also think we should merge list_trackable_jobs and list_active_jobs. It only adds to the confusion to have both I think.
Yeah, can't find any compelling reasons to keep both.
(lowered priority of this ticket, as the "critical" part is handled)
I added has_application_id
option to list_active_jobs
and removed list_trackable_jobs
now
I now also switched from checking "created" date to "updated" date with 5b21a985d70c2b3e376e0cee3cdb20e350b0ce37 (forgot to properly link back from commit message apparently)
which should close this ticket
JobTracker runs on the production instance of Terrascope take much longer than usual: around 10-15 minutes. The degradation seems to coincide with the recent promotion of both the web app and the JobTracker.