Open pcm32 opened 1 year ago
So my suggestion would be, given that only failed Jobs will make it to that check, to drop the
and pod.obj["status"]["phase"] == "Failed"
check.
Actually, it sees that OOMKilled jobs are "completed" even though the underlying pod is marked as OOMKilled. While the pod looks like this:
Name: gxy-galaxy-dev-6w4wt-kq9kv
Namespace: default
Priority: -1000
Priority Class Name: galaxy-dev-job-priority
Service Account: default
Node: dev-cluster-worker-001/192.168.42.224
Start Time: Tue, 06 Dec 2022 13:48:25 +0000
Labels: app.galaxyproject.org/destination=k8s
app.galaxyproject.org/handler=job_handler_0
app.galaxyproject.org/job_id=504
app.kubernetes.io/component=tool
app.kubernetes.io/instance=gxy-galaxy-dev
app.kubernetes.io/managed-by=galaxy
app.kubernetes.io/name=scanpy_multiplet_scrublet
app.kubernetes.io/part-of=galaxy
app.kubernetes.io/version=1.8.1_3_galaxy0
controller-uid=0d118eb2-488e-437d-9253-5dfbee8f07fa
job-name=gxy-galaxy-dev-6w4wt
Annotations: app.galaxyproject.org/tool_id: toolshed.g2.bx.psu.edu/repos/ebi-gxa/scanpy_multiplet_scrublet/scanpy_multiplet_scrublet/1.8.1+3+galaxy0
cni.projectcalico.org/containerID: 2aa453b49b3008735d10128324365e68cce19abcb71d60d6ef0cc118b6d3ca76
cni.projectcalico.org/podIP:
cni.projectcalico.org/podIPs:
kubernetes.io/psp: global-unrestricted-psp
Status: Succeeded
IP: 10.42.1.68
IPs:
IP: 10.42.1.68
Controlled By: Job/gxy-galaxy-dev-6w4wt
Containers:
k8s:
Container ID: containerd://c782e96108358b6877377ddc3ac48ffcc9c774ba131d1c0631667e11d9f4ca05
Image: quay.io/biocontainers/scanpy-scripts:1.1.3--pypyhdfd78af_0
Image ID: quay.io/biocontainers/scanpy-scripts@sha256:bb2d4ca0fc8a4c3d1f9f14bf3a6de210c3aade633821d0a6ffad3064f5995003
Port: <none>
Host Port: <none>
Command:
/bin/bash
Args:
-c
/galaxy/server/database/jobs_directory/000/504/galaxy_504.sh
State: Terminated
Reason: OOMKilled
Exit Code: 0
Started: Tue, 06 Dec 2022 13:48:27 +0000
Finished: Tue, 06 Dec 2022 13:48:57 +0000
see state part at the end, it is OOMKilled but exit code was 0.
And the controlling job is:
Name: gxy-galaxy-dev-6w4wt
Namespace: default
Selector: controller-uid=0d118eb2-488e-437d-9253-5dfbee8f07fa
Labels: app.galaxyproject.org/destination=k8s
app.galaxyproject.org/handler=job_handler_0
app.galaxyproject.org/job_id=504
app.kubernetes.io/component=tool
app.kubernetes.io/instance=gxy-galaxy-dev
app.kubernetes.io/managed-by=galaxy
app.kubernetes.io/name=scanpy_multiplet_scrublet
app.kubernetes.io/part-of=galaxy
app.kubernetes.io/version=1.8.1_3_galaxy0
controller-uid=0d118eb2-488e-437d-9253-5dfbee8f07fa
job-name=gxy-galaxy-dev-6w4wt
Annotations: batch.kubernetes.io/job-tracking:
Parallelism: 0
Completions: 1
Completion Mode: NonIndexed
Start Time: Tue, 06 Dec 2022 13:48:25 +0000
Completed At: Tue, 06 Dec 2022 13:49:02 +0000
Duration: 37s
Active Deadline Seconds: 172800s
Pods Statuses: 0 Active (0 Ready) / 1 Succeeded / 0 Failed
So it sees the pod as succeeded.
This means that early on the process its gets considered as correctly done: https://github.com/galaxyproject/galaxy/blob/release_22.05/lib/galaxy/jobs/runners/kubernetes.py#L735
then some other Galaxy process detects this is in error by looking at error files I think.
My bet is that we need to make sure that the galaxy wrapping script returns a non-zero exit code to pod so that k8s decides that this pod didn't succeed.
@nuwang would you have some time to look at this ?
Doing the same process on a container in the same k8s cluster:
root@scanpy-pod:/galaxy/server/database/jobs_directory/000/504/working# PYTHONIOENCODING=utf-8 scanpy-multiplet scrublet --input-format 'anndata' input.h5 --show-obj stdout --output-format anndata output.h5
Killed
root@scanpy-pod:/galaxy/server/database/jobs_directory/000/504/working# echo $?
137
And if done properly through a pod, it produces the expected error code:
Containers:
scanpy:
Container ID: containerd://1609bc42097f36331d9b16d59bbe67b7ce2fcc2f0800c24d12fe6cebbf7e842a
Image: quay.io/biocontainers/scanpy-scripts:1.1.6--pypyhdfd78af_0
Image ID: quay.io/biocontainers/scanpy-scripts@sha256:22cac5db87e70c0761aaee9e0649f83d57765064dc026dcbcb857c276ea8994e
Port: <none>
Host Port: <none>
Command:
/galaxy/server/database/jobs_directory/000/504/direct_run.sh
State: Running
Started: Tue, 06 Dec 2022 15:45:05 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 06 Dec 2022 15:44:44 +0000
Finished: Tue, 06 Dec 2022 15:44:51 +0000
So, is the runner able to influence how the galaxy_<job_id>.sh
script is written? If you give me a few pointers of where I should be looking, I can attempt a fix.
So Alex, Enis, Keith and I had a long conversation about this, and it took a while to figure out why the exit code was mismatching the OOMKilled message. We finally found this: https://stackoverflow.com/questions/71208960/container-restart-reason-oomkilled-with-exit-code-1 which explains what's going on. Essentially, Kubernetes is indeed sending the OOM, but to the culprit process, not PID 1. Which means that once scanpy-multiplet
quits, the galaxy script continues. The script's normal behaviour is to write the child error code to a file and return exit code 0.
Therefore, it looks like we have two ways to fix this: a. What you suggested - which is to drop the `and pod.obj["status"]["phase"] == "Failed" bit. b. Change the Galaxy shell script to return certain known error codes back the job manager. OOM would probably be something that qualifies @mvdbeek?
I guess in a wider sense, the issue is about errors that job managers should be made aware of, versus internal tool errors that should be internally handled by Galaxy. In its current iteration, the script is swallowing and logging all errors, which would later translate on the Galaxy side to a job failure.
Thanks for taking a look! I tried option A already and it is not enough because of the success status that gets picked up much earlier. So I think we would need to go for option B, or else make the runner rely on the files for the status rather than what it gets from pykube.
What you suggested - which is to drop the `and pod.obj["status"]["phase"] == "Failed" bit.
That seems like the less invasive option ?
b. Change the Galaxy shell script to return certain known error codes back the job manager. OOM would probably be something that qualifies @mvdbeek?
I think that'd be a big change ? Not sure I like the prospects of that, but if you can isolate that to kubernetes, then why not ?
On a unrelated note, y'all might want to migrate to pykube-ng ? pykube has not been updated since 2017
Maybe if option B is too disruptive, the k8s Job could read the galaxy_<job_id>.ec
file and return the error code from there right after executing the galaxy_<job_id>.sh
script? it is a bit nasty but should do what is needed. Is the .ec
file always expected to be there?
Like this?
k8s_container = {
"name": self.__get_k8s_container_name(ajs.job_wrapper),
"image": container.container_id,
# this form of command overrides the entrypoint and allows multi command
# command line execution, separated by ;, which is what Galaxy does
# to assemble the command.
"command": [ajs.job_wrapper.shell],
"args": ["-c", f"{ajs.job_file}; exit $(cat {ajs.exit_code_file})"]
"workingDir": ajs.job_wrapper.working_directory,
"volumeMounts": deduplicate_entries(mounts),
}
It works ;-)
Containers:
k8s:
Container ID: containerd://5e6ff9702e6f3e0b40b5e005ca370556d06e39b8e6a18cbd1176b2d3ef142350
Image: quay.io/biocontainers/scanpy-scripts:1.1.3--pypyhdfd78af_0
Image ID: quay.io/biocontainers/scanpy-scripts@sha256:bb2d4ca0fc8a4c3d1f9f14bf3a6de210c3aade633821d0a6ffad3064f5995003
Port: <none>
Host Port: <none>
Command:
/bin/bash
Args:
-c
/galaxy/server/database/jobs_directory/000/507/galaxy_507.sh; exit $(cat /galaxy/server/database/jobs_directory/000/507/galaxy_507.ec)
State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 06 Dec 2022 20:50:57 +0000
Finished: Tue, 06 Dec 2022 20:51:28 +0000
let me know if you are happy with the approach to PR.
There are still other issues with the runner and the OOM detection, but the above takes us closer.
Ok, I have a whole set of changes that make this work:
My thinking is that it's safer to not modify the job shell command, because the job can fail for reasons other than OOM, and k8s would now attempt to handle those failures as well, which it has no real idea how to do. It'll probably end up blindly retrying the jobs, to no avail.
Instead, can't we add some extra processing here?: https://github.com/galaxyproject/galaxy/blob/2a6f9c2a297a2ce8e4f501fc3b3e9e2d103d59d2/lib/galaxy/jobs/runners/kubernetes.py#L735
Although it's marked as succeeded, we can always retrieve the pod and check for the OOMKilled state regardless? Or perhaps, whenever "Reason" is not empty, we can assume that it's a k8s level error, and cause the job to be retried?
My concerns are the following:
Catching the state at line 732 means that for every single job (because they will be all "successful"), and not only failed ones, we will be doing extra calls to the k8s API, substantially increasing traffic on a busy setup, and increasing the possibility of failure due to missed / unattended k8s API calls. Checking at that level for the different types of failures means extra calls to the API to learn about the pods individually; with the solution I proposed, we learn that (if the pod failed or not) from the original call about the k8s Job object, as the pod exit code reflects the underlying Galaxy job exit code.
k8s would now attempt to handle those failures as well, which it has no real idea how to do.
Well, the runner is doing very specific things:
It'll probably end up blindly retrying the jobs, to no avail.
You mean due to the pods retrials? Agreed, but this part should go anyway (I would suggest that the new default is retrials = 0). The other option would be to move completely outside of k8s Jobs and deal directly with Pods, or did you have something else in mind when saying that k8s would end up blindly retrying?
No, that's pretty much what I meant.
I can't say I have a very clear view on this:
If the job script is swallowing the 137 OOM error, then perhaps the OOM should have been handled in Galaxy's post-job error processing, and not the job handler? Somewhere internally within Galaxy, this should happen: https://github.com/galaxyproject/galaxy/blob/f2c221f48402259d3f9455e1d094950c68b66078/lib/galaxy/jobs/runners/kubernetes.py#L803 Of course, then it's no longer a runner_state, because the runner doesn't even know about it.
If on the other hand, the job script is modified to always return the error to k8s, maybe that's ok too. The job would be marked as failed, but that would be ok as long as Galaxy doesn't treat it as a failure at the DRM level, but instead, at the job level.
So, how do you want me to proceed:
1.- To check at the beginning, for every successful job, if they in fact failed due to OOM and we leave the failed jobs detection to Galaxy to avoid the extra pods. This has the downside of more API calls, with no immediate mitigation that I can think of at least.
or
2.- Make k8s aware that the galaxy process failed (including OOMKilled) by passing the tool exit code to k8s and let the runner handle the failure. This has the immediate effect of potentially making k8s to keep sending more pods as per that job. I could bundle a change in the retrials default or we could add an option to set the Job's spec.backoffLimit and set it by default to 1 (so no repetition after failure).
Galaxy doesn't treat it as a failure at the DRM level
it would be interesting to see if k8s has ways of signalling this (ie. I lost the pod because the node where it was running went away) in the pods status.
I think Option 2 seems more attractive if we can do it safely, but could fail if the exit $(cat {ajs.exit_code_file})
line fails. I guess we could try that first, and fall back to option 1 if things don't work out?
One thing I'm still a bit lost with is that on failure, the database and UI don't seem to capture the stdout and stderr of the tool. On a successful job those make it somehow from ../outputs/tool_stderr
and ../outputs/tool_stdout
to galaxy_{id_tag}.e
and galaxy_{id_tag}.o
respectively (I don't seem to find where), and then those are added into the database through _job_io_for_db()
calls with the finish part. On failure though ../outputs/tool_stderr
and ../outputs/tool_stdout
never make it to galaxy_{id_tag}.e
and galaxy_{id_tag}.o
, and nor are these uploaded to the DB.
Previously this worked because all jobs went through the finish method, but doing that breaks the resubmissions I think.
Any pointers appreciated.
Locally, I added a def fail_job
that overrides the parent to call _job_io_for_db()
, but since galaxy_{id_tag}.e
and galaxy_{id_tag}.o
, it makes no difference :-(.
../outputs/tool_stderr and ../outputs/toolstdout make it to galaxy{idtag}.e and galaxy{id_tag}.o
This looks wrong to me. Is that a mistake in the k8s runner? To my understanding both are separate files with separate meanings and the condor runner has different content in them.
This is fixed in #15238.
Describe the bug
Possibly due to the handling of the error code through the
galaxy_<job_id>.sh
script wrapper (in contrast, thetool_script.sh
doesn't seem to impose any exit codes), when the k8s runner checks whether the pod failied due to low memory:https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/jobs/runners/kubernetes.py#L858-L862
the pykube command seems to no longer get a "Failed" in
pod.obj["status"]["phase"]
.Looking through pykube in a venv againts a k8s job and pod that failed due to OOMKilled:
gets
Succeeded
(possibly because Galaxy catches the error code I presume).Galaxy Version and/or server at which you observed the bug Galaxy 22.05 as provisioned by the Galaxy Helm Chart.
To Reproduce Steps to reproduce the behavior:
See section above in python.
Expected behavior
The Jobs should be marked as failed due to OOM.
Screenshots
However, what you see is just the job being marked as failed, not failed as OOM in Galaxy:
Additional context
My interest here is to re-enable automatic submissions based on OOM, but for that first the job needs to be properly tagged.