Open mapk-amazon opened 3 months ago
We recently received a similar report and I originally thought it may be related to Kubernetes 1.30 and the pykube-ng
version we use. However, you are using 1.28 and I have been unable to recreate the problem. The one common thread is EKS. I will investigate that next.
I can test with various EKS versions, however, I am not sure how to build a minimal example withpykube-ng
. If you have a snippet to produce a similar effect to Galaxy job scheduling, I can test it and report back :)
Is there a stack trace? Or can the verbosity level be increased to produce one? If not, I think we have a problem with the error being inadequately logged, and we need to figure out which line of code is generating the exception.
Most likely, this is caused by a race condition between k8s modifying the job status, and the runner attempting to read and modify the manifest itself. As mentioned earlier, the resultant hash collision of resourceVersion would cause this conflict. So if we re-queue the current task whenever this error is encountered, the runner thread should eventually fetch the latest version, and succeed I would expect.
This is "the most" detailed log I get:
galaxy-job-0 galaxy.jobs.runners.kubernetes ERROR 2024-07-29 13:46:58,387 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] Could not clean up k8s batch job. Ignoring... │
│ galaxy-job-0 Traceback (most recent call last): │
│ galaxy-job-0 File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/http.py", line 403, in raise_for_status │
│ galaxy-job-0 resp.raise_for_status() │
│ galaxy-job-0 File "/galaxy/server/.venv/lib/python3.12/site-packages/requests/models.py", line 1021, in raise_for_status │
│ galaxy-job-0 raise HTTPError(http_error_msg, response=self) │
│ galaxy-job-0 requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://172.20.0.1:443/apis/batch/v1/namespaces/galaxy/jobs/gxy-galaxy-4db2n │
│ galaxy-job-0 │
│ galaxy-job-0 During handling of the above exception, another exception occurred: │
│ galaxy-job-0 │
│ galaxy-job-0 Traceback (most recent call last): │
│ galaxy-job-0 File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 872, in _handle_job_failure │
│ galaxy-job-0 self.__cleanup_k8s_job(job) │
│ galaxy-job-0 File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 879, in __cleanup_k8s_job │
│ galaxy-job-0 delete_job(job, k8s_cleanup_job) │
│ galaxy-job-0 File "/galaxy/server/lib/galaxy/jobs/runners/util/pykube_util.py", line 108, in delete_job │
│ galaxy-job-0 job.scale(replicas=0) │
│ galaxy-job-0 File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/mixins.py", line 31, in scale │
│ galaxy-job-0 self.update() │
│ galaxy-job-0 File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/objects.py", line 165, in update │
│ galaxy-job-0 self.patch(self.obj, subresource=subresource) │
│ galaxy-job-0 File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/objects.py", line 157, in patch │
│ galaxy-job-0 self.api.raise_for_status(r) │
│ galaxy-job-0 File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/http.py", line 410, in raise_for_status │
│ galaxy-job-0 raise HTTPError(resp.status_code, payload["message"]) │
│ galaxy-job-0 pykube.exceptions.HTTPError: Operation cannot be fulfilled on jobs.batch "gxy-galaxy-4db2n": the object has been modified; please apply your changes to the latest version and try again
Thanks. That helps with narrowing things down.
To change/update the pykube-ng
version requires building a new galaxy-min
docker image. I have limited internet connectivity at the moment so it is not easy for me to build and push a new image right now, but I'll try to get that done in the next few days.
How do you build the galaxy-min
docker image?
Is it building this as-is, or is there a "min" configuration somewhere?
@mapk-amazon That's the right image. Building it as is will do the job. If you'd like to test the changes, please try this branch: https://github.com/galaxyproject/galaxy/pull/18514 This has some fixes, including the pykube upgrade that may solve this issue.
Fwiw @mapk-amazon , you can also use ghcr.io/bioconductor/galaxy:dev
which is the built image from that PR.
Thank you all. I used ghcr.io/bioconductor/galaxy:dev
, otherwise the same setup as in the start. I uploaded 100x 1MB files with random content. It failed for 2 with the same error:
│ galaxy.jobs.runners.kubernetes ERROR 2024-08-06 18:06:32,493 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] Could not clean up k8s batch job. Ignoring... │
│ Traceback (most recent call last): │
│ File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/http.py", line 437, in raise_for_status │
│ resp.raise_for_status() │
│ File "/galaxy/server/.venv/lib/python3.12/site-packages/requests/models.py", line 1024, in raise_for_status │
│ raise HTTPError(http_error_msg, response=self) │
│ requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://172.20.0.1:443/apis/batch/v1/namespaces/galaxy/jobs/gxy-galaxy-vnjqk │
│ │
│ During handling of the above exception, another exception occurred: │
│ │
│ Traceback (most recent call last): │
│ File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 912, in _handle_job_failure │
│ self.__cleanup_k8s_job(job) │
│ File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 919, in __cleanup_k8s_job │
│ delete_job(job, k8s_cleanup_job) │
│ File "/galaxy/server/lib/galaxy/jobs/runners/util/pykube_util.py", line 115, in delete_job │
│ job.scale(replicas=0) │
│ File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/mixins.py", line 30, in scale │
│ self.update() │
│ File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/objects.py", line 165, in update │
│ self.patch(self.obj, subresource=subresource) │
│ File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/objects.py", line 157, in patch │
│ self.api.raise_for_status(r) │
│ File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/http.py", line 444, in raise_for_status │
│ raise HTTPError(resp.status_code, payload["message"])
Thanks @mapk-amazon, it sure looks like a race condition. How did you upload the 100 files? Through the UI, API, or other means (bioblend etc)?
While this is shown as an error in the logs, I think that the behaviour of the code is harmless. Do you actually see the failure in the UI? That is why we added that "ignoring" part there.
On Fri, 16 Aug 2024, 16:47 Keith Suderman, @.***> wrote:
Thanks @mapk-amazon https://github.com/mapk-amazon, it sure looks like a race condition. How did you upload the 100 files? Through the UI, API, or other means (bioblend etc)?
— Reply to this email directly, view it on GitHub https://github.com/galaxyproject/galaxy-helm/issues/490#issuecomment-2293746189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACZ6XXXMY4HYX54256ARN3ZRYNIXAVCNFSM6AAAAABLLBZ3V2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTG42DMMJYHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
When running hundreds of jobs, you're are always bound to get some arbitrary errors, we mitigate that in our use of the setup with aggressive resubmission policies.
On Fri, 16 Aug 2024, 17:15 Pablo Moreno, @.***> wrote:
While this is shown as an error in the logs, I think that the behaviour of the code is harmless. Do you actually see the failure in the UI? That is why we added that "ignoring" part there.
On Fri, 16 Aug 2024, 16:47 Keith Suderman, @.***> wrote:
Thanks @mapk-amazon https://github.com/mapk-amazon, it sure looks like a race condition. How did you upload the 100 files? Through the UI, API, or other means (bioblend etc)?
— Reply to this email directly, view it on GitHub https://github.com/galaxyproject/galaxy-helm/issues/490#issuecomment-2293746189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACZ6XXXMY4HYX54256ARN3ZRYNIXAVCNFSM6AAAAABLLBZ3V2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTG42DMMJYHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thank you for your input!
@ksuderman I use the webinterface. I can try the API if you think it makes a difference. @pcm32 Yes, the job fails. It looks like this on the UI then.
But yes, I do see this error every now and then in our logs, maybe I don't see it in the UI as an error due to the resubmissions.
When running hundreds of jobs, you're are always bound to get some arbitrary errors, we mitigate that in our use of the setup with aggressive resubmission policies.
True, but we are getting reports of the 409 Client Error
errors from other users even with only a handful of jobs, but I've never been able to recreate the error myself. I do get occasional failures when running lots of jobs, but I don't recall them being a 409
. I am hoping to find a common underlying cause
@mapk-amazon no need to try the API, I just want to make sure I am using the same procedure when I try to recreate the problem..
Update : I believe I know now what is happening. In my understanding the aggressive "retries" are the root cause of the issues.
The job pod (the one scheduling the pods) shows for failing pods, that "Galaxy" receives twice the information about the pod.
DEBUG 2024-10-14 20:12:36,480 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] Job id: gxy-galaxy-dkpc5 with k8s id: gxy-galaxy-dkpc5 succeeded
DEBUG 2024-10-14 20:12:38,484 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] Job id: gxy-galaxy-dkpc5 with k8s id: gxy-galaxy-dkpc5 succeeded
Then it starts cleaning (twice) and one fails, as the other one already deleted/starting deletion. Finally, it shows tool_stdout
and tool_stderr
twice:
DEBUG 2024-10-14 20:12:54,185 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (3464/gxy-galaxy-dkpc5) tool_stdout:
DEBUG 2024-10-14 20:12:54,186 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (3464/gxy-galaxy-dkpc5) job_stdout:
DEBUG 2024-10-14 20:12:54,186 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (3464/gxy-galaxy-dkpc5) tool_stderr:
DEBUG 2024-10-14 20:12:54,186 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (3464/gxy-galaxy-dkpc5) job_stderr: Job output not returned from cluster
It seems the first job moved the data already and the second did no longer found the file.
The result is a technically successful job (as the container finished), the results were processed successfully once, and the second iteration (the later one) responds with an error and Galaxy believes the job fails.
Update 2: I believe I was wrong (yet again). Please take a look at the PR https://github.com/galaxyproject/galaxy/pull/19001 :)
Setup
The setup is deployed on AWS on EKS:
Issue
Galaxy "usually" deploys jobs just fine. We started importing with Batch files into Galaxy and experience random failures of pods.
Logs
and
In the k8s log we also see that the pods was launched around the time:
Ideas/Hypothesis
Current ideas are that the hash (e.g.
f4b62
) has a collision and leads to resource conflicts for the pods and to failures of some jobs.Does the team has any experience with it? Any fixes? Thank you :)