Multiple/many parallel jobs lead to "random" failures

galaxyproject / galaxy-helm

Minimal setup required to run Galaxy under Kubernetes

MIT License

42 stars 39 forks source link

Multiple/many parallel jobs lead to "random" failures #490

Open mapk-amazon opened 3 months ago

mapk-amazon commented 3 months ago

Setup

The setup is deployed on AWS on EKS:

Version k8s: 1.28
Version Helm Chart: v5.9.0

Issue

Galaxy "usually" deploys jobs just fine. We started importing with Batch files into Galaxy and experience random failures of pods.

Logs

galaxy.jobs.runners.kubernetes ERROR 2024-07-23 15:35:40,109 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] No Jobs are available under expected selector app=gxy-galaxy-g674v
galaxy.jobs.runners.kubernetes ERROR 2024-07-23 15:35:40,120 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] No Jobs are available under expected selector app=gxy-galaxy-zpgqx
galaxy.jobs.runners.kubernetes ERROR 2024-07-23 15:35:40,130 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] No Jobs are available under expected selector app=gxy-galaxy-7kl4g
galaxy.jobs.runners.kubernetes ERROR 2024-07-23 15:35:40,159 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] No Jobs are available under expected selector app=gxy-galaxy-g9ts6

and

requests.exceptions.HTTPError: 409 Client Error: Conflict for url: [https://172.20.0.1:443/apis/batch/v1/namespaces/galaxy/jobs/gxy-galaxy-f4b62](https://172.20.0.1/apis/batch/v1/namespaces/galaxy/jobs/gxy-galaxy-f4b62)
pykube.exceptions.HTTPError: Operation cannot be fulfilled on jobs.batch "gxy-galaxy-f4b62": the object has been modified; please apply your changes to the latest version and try again

In the k8s log we also see that the pods was launched around the time:

gxy-galaxy-f4b62-95mlz               0/1     ContainerCreating   0          1s
gxy-galaxy-f4b62-95mlz               1/1     Running             0          4s
gxy-galaxy-f4b62-95mlz               0/1     Completed           0          7s

Ideas/Hypothesis

Current ideas are that the hash (e.g. f4b62) has a collision and leads to resource conflicts for the pods and to failures of some jobs.

Does the team has any experience with it? Any fixes? Thank you :)

ksuderman commented 3 months ago

We recently received a similar report and I originally thought it may be related to Kubernetes 1.30 and the pykube-ng version we use. However, you are using 1.28 and I have been unable to recreate the problem. The one common thread is EKS. I will investigate that next.

See https://github.com/galaxyproject/galaxy/issues/18567

mapk-amazon commented 3 months ago

I can test with various EKS versions, however, I am not sure how to build a minimal example withpykube-ng. If you have a snippet to produce a similar effect to Galaxy job scheduling, I can test it and report back :)

nuwang commented 3 months ago

Is there a stack trace? Or can the verbosity level be increased to produce one? If not, I think we have a problem with the error being inadequately logged, and we need to figure out which line of code is generating the exception.

Most likely, this is caused by a race condition between k8s modifying the job status, and the runner attempting to read and modify the manifest itself. As mentioned earlier, the resultant hash collision of resourceVersion would cause this conflict. So if we re-queue the current task whenever this error is encountered, the runner thread should eventually fetch the latest version, and succeed I would expect.

mapk-amazon commented 3 months ago

This is "the most" detailed log I get:

galaxy-job-0 galaxy.jobs.runners.kubernetes ERROR 2024-07-29 13:46:58,387 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] Could not clean up k8s batch job. Ignoring...                                 │
│ galaxy-job-0 Traceback (most recent call last):                                                                                                                                                                   │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/http.py", line 403, in raise_for_status                                                                                             │
│ galaxy-job-0     resp.raise_for_status()                                                                                                                                                                          │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/requests/models.py", line 1021, in raise_for_status                                                                                        │
│ galaxy-job-0     raise HTTPError(http_error_msg, response=self)                                                                                                                                                   │
│ galaxy-job-0 requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://172.20.0.1:443/apis/batch/v1/namespaces/galaxy/jobs/gxy-galaxy-4db2n                                                      │
│ galaxy-job-0                                                                                                                                                                                                      │
│ galaxy-job-0 During handling of the above exception, another exception occurred:                                                                                                                                  │
│ galaxy-job-0                                                                                                                                                                                                      │
│ galaxy-job-0 Traceback (most recent call last):                                                                                                                                                                   │
│ galaxy-job-0   File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 872, in _handle_job_failure                                                                                                      │
│ galaxy-job-0     self.__cleanup_k8s_job(job)                                                                                                                                                                      │
│ galaxy-job-0   File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 879, in __cleanup_k8s_job                                                                                                        │
│ galaxy-job-0     delete_job(job, k8s_cleanup_job)                                                                                                                                                                 │
│ galaxy-job-0   File "/galaxy/server/lib/galaxy/jobs/runners/util/pykube_util.py", line 108, in delete_job                                                                                                         │
│ galaxy-job-0     job.scale(replicas=0)                                                                                                                                                                            │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/mixins.py", line 31, in scale                                                                                                       │
│ galaxy-job-0     self.update()                                                                                                                                                                                    │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/objects.py", line 165, in update                                                                                                    │
│ galaxy-job-0     self.patch(self.obj, subresource=subresource)                                                                                                                                                    │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/objects.py", line 157, in patch                                                                                                     │
│ galaxy-job-0     self.api.raise_for_status(r)                                                                                                                                                                     │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/http.py", line 410, in raise_for_status                                                                                             │
│ galaxy-job-0     raise HTTPError(resp.status_code, payload["message"])                                                                                                                                            │
│ galaxy-job-0 pykube.exceptions.HTTPError: Operation cannot be fulfilled on jobs.batch "gxy-galaxy-4db2n": the object has been modified; please apply your changes to the latest version and try again

nuwang commented 3 months ago

Thanks. That helps with narrowing things down.

ksuderman commented 3 months ago

To change/update the pykube-ng version requires building a new galaxy-min docker image. I have limited internet connectivity at the moment so it is not easy for me to build and push a new image right now, but I'll try to get that done in the next few days.

mapk-amazon commented 3 months ago

How do you build the galaxy-min docker image?

Is it building this as-is, or is there a "min" configuration somewhere?

nuwang commented 3 months ago

@mapk-amazon That's the right image. Building it as is will do the job. If you'd like to test the changes, please try this branch: https://github.com/galaxyproject/galaxy/pull/18514 This has some fixes, including the pykube upgrade that may solve this issue.

almahmoud commented 3 months ago

Fwiw @mapk-amazon , you can also use ghcr.io/bioconductor/galaxy:dev which is the built image from that PR.

mapk-amazon commented 3 months ago

Thank you all. I used ghcr.io/bioconductor/galaxy:dev, otherwise the same setup as in the start. I uploaded 100x 1MB files with random content. It failed for 2 with the same error:

│ galaxy.jobs.runners.kubernetes ERROR 2024-08-06 18:06:32,493 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] Could not clean up k8s batch job. Ignoring...                                                                                            │
│ Traceback (most recent call last):                                                                                                                                                                                                                              │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/http.py", line 437, in raise_for_status                                                                                                                                                        │
│     resp.raise_for_status()                                                                                                                                                                                                                                     │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/requests/models.py", line 1024, in raise_for_status                                                                                                                                                   │
│     raise HTTPError(http_error_msg, response=self)                                                                                                                                                                                                              │
│ requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://172.20.0.1:443/apis/batch/v1/namespaces/galaxy/jobs/gxy-galaxy-vnjqk                                                                                                                 │
│                                                                                                                                                                                                                                                                 │
│ During handling of the above exception, another exception occurred:                                                                                                                                                                                             │
│                                                                                                                                                                                                                                                                 │
│ Traceback (most recent call last):                                                                                                                                                                                                                              │
│   File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 912, in _handle_job_failure                                                                                                                                                                 │
│     self.__cleanup_k8s_job(job)                                                                                                                                                                                                                                 │
│   File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 919, in __cleanup_k8s_job                                                                                                                                                                   │
│     delete_job(job, k8s_cleanup_job)                                                                                                                                                                                                                            │
│   File "/galaxy/server/lib/galaxy/jobs/runners/util/pykube_util.py", line 115, in delete_job                                                                                                                                                                    │
│     job.scale(replicas=0)                                                                                                                                                                                                                                       │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/mixins.py", line 30, in scale                                                                                                                                                                  │
│     self.update()                                                                                                                                                                                                                                               │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/objects.py", line 165, in update                                                                                                                                                               │
│     self.patch(self.obj, subresource=subresource)                                                                                                                                                                                                               │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/objects.py", line 157, in patch                                                                                                                                                                │
│     self.api.raise_for_status(r)                                                                                                                                                                                                                                │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/http.py", line 444, in raise_for_status                                                                                                                                                        │
│     raise HTTPError(resp.status_code, payload["message"])

ksuderman commented 2 months ago

Thanks @mapk-amazon, it sure looks like a race condition. How did you upload the 100 files? Through the UI, API, or other means (bioblend etc)?

pcm32 commented 2 months ago

While this is shown as an error in the logs, I think that the behaviour of the code is harmless. Do you actually see the failure in the UI? That is why we added that "ignoring" part there.

On Fri, 16 Aug 2024, 16:47 Keith Suderman, @.***> wrote:

Thanks @mapk-amazon https://github.com/mapk-amazon, it sure looks like a race condition. How did you upload the 100 files? Through the UI, API, or other means (bioblend etc)?

— Reply to this email directly, view it on GitHub https://github.com/galaxyproject/galaxy-helm/issues/490#issuecomment-2293746189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACZ6XXXMY4HYX54256ARN3ZRYNIXAVCNFSM6AAAAABLLBZ3V2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTG42DMMJYHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

pcm32 commented 2 months ago

When running hundreds of jobs, you're are always bound to get some arbitrary errors, we mitigate that in our use of the setup with aggressive resubmission policies.

On Fri, 16 Aug 2024, 17:15 Pablo Moreno, @.***> wrote:

While this is shown as an error in the logs, I think that the behaviour of the code is harmless. Do you actually see the failure in the UI? That is why we added that "ignoring" part there.

On Fri, 16 Aug 2024, 16:47 Keith Suderman, @.***> wrote:

Thanks @mapk-amazon https://github.com/mapk-amazon, it sure looks like a race condition. How did you upload the 100 files? Through the UI, API, or other means (bioblend etc)?

— Reply to this email directly, view it on GitHub https://github.com/galaxyproject/galaxy-helm/issues/490#issuecomment-2293746189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACZ6XXXMY4HYX54256ARN3ZRYNIXAVCNFSM6AAAAABLLBZ3V2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTG42DMMJYHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mapk-amazon commented 2 months ago

Thank you for your input!

@ksuderman I use the webinterface. I can try the API if you think it makes a difference. @pcm32 Yes, the job fails. It looks like this on the UI then.

pcm32 commented 2 months ago

But yes, I do see this error every now and then in our logs, maybe I don't see it in the UI as an error due to the resubmissions.

ksuderman commented 2 months ago

When running hundreds of jobs, you're are always bound to get some arbitrary errors, we mitigate that in our use of the setup with aggressive resubmission policies.

True, but we are getting reports of the 409 Client Error errors from other users even with only a handful of jobs, but I've never been able to recreate the error myself. I do get occasional failures when running lots of jobs, but I don't recall them being a 409. I am hoping to find a common underlying cause

@mapk-amazon no need to try the API, I just want to make sure I am using the same procedure when I try to recreate the problem..

mapk-amazon commented 3 weeks ago

Update : I believe I know now what is happening. In my understanding the aggressive "retries" are the root cause of the issues.

The job pod (the one scheduling the pods) shows for failing pods, that "Galaxy" receives twice the information about the pod.

DEBUG 2024-10-14 20:12:36,480 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] Job id: gxy-galaxy-dkpc5 with k8s id: gxy-galaxy-dkpc5 succeeded
DEBUG 2024-10-14 20:12:38,484 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] Job id: gxy-galaxy-dkpc5 with k8s id: gxy-galaxy-dkpc5 succeeded

Then it starts cleaning (twice) and one fails, as the other one already deleted/starting deletion. Finally, it shows tool_stdout and tool_stderr twice:

DEBUG 2024-10-14 20:12:54,185 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (3464/gxy-galaxy-dkpc5) tool_stdout: 
DEBUG 2024-10-14 20:12:54,186 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (3464/gxy-galaxy-dkpc5) job_stdout:

DEBUG 2024-10-14 20:12:54,186 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (3464/gxy-galaxy-dkpc5) tool_stderr: 
DEBUG 2024-10-14 20:12:54,186 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (3464/gxy-galaxy-dkpc5) job_stderr: Job output not returned from cluster

It seems the first job moved the data already and the second did no longer found the file.

The result is a technically successful job (as the container finished), the results were processed successfully once, and the second iteration (the later one) responds with an error and Galaxy believes the job fails.

mapk-amazon commented 3 weeks ago

Update 2: I believe I was wrong (yet again). Please take a look at the PR https://github.com/galaxyproject/galaxy/pull/19001 :)