galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.4k stars 1.01k forks source link

K8S Runner: Race condition when modifying job #11734

Open innovate-invent opened 3 years ago

innovate-invent commented 3 years ago

Galaxy 21.01

Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 526, in _handle_job_failure
    self.__cleanup_k8s_job(job)
  File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 533, in __cleanup_k8s_job
    stop_job(job, k8s_cleanup_job)
  File "/srv/galaxy/lib/galaxy/jobs/runners/util/pykube_util.py", line 75, in stop_job
    job.scale(replicas=0)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/pykube/mixins.py", line 32, in scale
    self.update()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/pykube/objects.py", line 119, in update
    self.api.raise_for_status(r)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/pykube/http.py", line 106, in raise_for_status
    raise HTTPError(resp.status_code, payload["message"])
pykube.exceptions.HTTPError: Operation cannot be fulfilled on jobs.batch "gxy-islandcompare-test-tlpmh": the object has been modified; please apply your changes to the latest version and try
 again

The runner needs to catch this, refresh, and retry.

mvdbeek commented 3 years ago

That's fixed with https://github.com/galaxyproject/galaxy/pull/11715, right ? Thanks for the report and fix!

innovate-invent commented 3 years ago

Actually, this wouldn't be covered by #11715 I totally forgot about this.

pascalg commented 3 months ago

This issue is still causing regular (but random) job failures for us. I'd say every 50th job is affected by this. We thought this would have been addressed in https://github.com/galaxyproject/galaxy/pull/15238 for kubernetes >=1.26, but the issue persists with Galaxy v24.1 (just now the error message is "An unknown error occurered with this job" [sic]).