Kubernetes runner doesn't detect OOMKilled properly

pcm32 commented 1 year ago

Describe the bug

Possibly due to the handling of the error code through the galaxy_<job_id>.sh script wrapper (in contrast, the tool_script.sh doesn't seem to impose any exit codes), when the k8s runner checks whether the pod failied due to low memory:

https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/jobs/runners/kubernetes.py#L858-L862

        pod = self._get_pod_for_job(job_state)
        if (
            pod
            and pod.obj["status"]["phase"] == "Failed"
            and pod.obj["status"]["containerStatuses"][0]["state"]["terminated"]["reason"] == "OOMKilled"
        ):

the pykube command seems to no longer get a "Failed" in pod.obj["status"]["phase"].

Looking through pykube in a venv againts a k8s job and pod that failed due to OOMKilled:

kubectl get pods | grep dev-nj64n
gxy-galaxy-dev-nj64n-d9xxx                                        0/1     OOMKilled   0          2m29s

>>> from pykube.config import KubeConfig
>>> from pykube.exceptions import HTTPError
>>> from pykube.http import HTTPClient
>>> from pykube.objects import Job, Pod
>>> pykube_client = HTTPClient(KubeConfig.from_file("myKube.yaml"))
>>> job_name = "gxy-galaxy-dev-nj64n"
>>> pod = Pod.objects(pykube_client).filter(selector=f"job-name={job_name}", namespace="default")
>>> pod
<pykube.query.Query object at 0x104831d20>
>>> pod.response["items"][0]['status']['containerStatuses'][0]['state']['terminated']['reason']
'OOMKilled'
>>> pod.response["items"][0]['status']['phase']
'Succeeded'

gets Succeeded (possibly because Galaxy catches the error code I presume).

Galaxy Version and/or server at which you observed the bug Galaxy 22.05 as provisioned by the Galaxy Helm Chart.

To Reproduce Steps to reproduce the behavior:

See section above in python.

Expected behavior

The Jobs should be marked as failed due to OOM.

Screenshots

However, what you see is just the job being marked as failed, not failed as OOM in Galaxy:

Additional context

My interest here is to re-enable automatic submissions based on OOM, but for that first the job needs to be properly tagged.

pcm32 commented 1 year ago

So my suggestion would be, given that only failed Jobs will make it to that check, to drop the

and pod.obj["status"]["phase"] == "Failed"

check.

pcm32 commented 1 year ago

Actually, it sees that OOMKilled jobs are "completed" even though the underlying pod is marked as OOMKilled. While the pod looks like this:

Name:                 gxy-galaxy-dev-6w4wt-kq9kv
Namespace:            default
Priority:             -1000
Priority Class Name:  galaxy-dev-job-priority
Service Account:      default
Node:                 dev-cluster-worker-001/192.168.42.224
Start Time:           Tue, 06 Dec 2022 13:48:25 +0000
Labels:               app.galaxyproject.org/destination=k8s
                      app.galaxyproject.org/handler=job_handler_0
                      app.galaxyproject.org/job_id=504
                      app.kubernetes.io/component=tool
                      app.kubernetes.io/instance=gxy-galaxy-dev
                      app.kubernetes.io/managed-by=galaxy
                      app.kubernetes.io/name=scanpy_multiplet_scrublet
                      app.kubernetes.io/part-of=galaxy
                      app.kubernetes.io/version=1.8.1_3_galaxy0
                      controller-uid=0d118eb2-488e-437d-9253-5dfbee8f07fa
                      job-name=gxy-galaxy-dev-6w4wt
Annotations:          app.galaxyproject.org/tool_id: toolshed.g2.bx.psu.edu/repos/ebi-gxa/scanpy_multiplet_scrublet/scanpy_multiplet_scrublet/1.8.1+3+galaxy0
                      cni.projectcalico.org/containerID: 2aa453b49b3008735d10128324365e68cce19abcb71d60d6ef0cc118b6d3ca76
                      cni.projectcalico.org/podIP:
                      cni.projectcalico.org/podIPs:
                      kubernetes.io/psp: global-unrestricted-psp
Status:               Succeeded
IP:                   10.42.1.68
IPs:
  IP:           10.42.1.68
Controlled By:  Job/gxy-galaxy-dev-6w4wt
Containers:
  k8s:
    Container ID:  containerd://c782e96108358b6877377ddc3ac48ffcc9c774ba131d1c0631667e11d9f4ca05
    Image:         quay.io/biocontainers/scanpy-scripts:1.1.3--pypyhdfd78af_0
    Image ID:      quay.io/biocontainers/scanpy-scripts@sha256:bb2d4ca0fc8a4c3d1f9f14bf3a6de210c3aade633821d0a6ffad3064f5995003
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
    Args:
      -c
      /galaxy/server/database/jobs_directory/000/504/galaxy_504.sh
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    0
      Started:      Tue, 06 Dec 2022 13:48:27 +0000
      Finished:     Tue, 06 Dec 2022 13:48:57 +0000

see state part at the end, it is OOMKilled but exit code was 0.

And the controlling job is:

Name:                     gxy-galaxy-dev-6w4wt
Namespace:                default
Selector:                 controller-uid=0d118eb2-488e-437d-9253-5dfbee8f07fa
Labels:                   app.galaxyproject.org/destination=k8s
                          app.galaxyproject.org/handler=job_handler_0
                          app.galaxyproject.org/job_id=504
                          app.kubernetes.io/component=tool
                          app.kubernetes.io/instance=gxy-galaxy-dev
                          app.kubernetes.io/managed-by=galaxy
                          app.kubernetes.io/name=scanpy_multiplet_scrublet
                          app.kubernetes.io/part-of=galaxy
                          app.kubernetes.io/version=1.8.1_3_galaxy0
                          controller-uid=0d118eb2-488e-437d-9253-5dfbee8f07fa
                          job-name=gxy-galaxy-dev-6w4wt
Annotations:              batch.kubernetes.io/job-tracking:
Parallelism:              0
Completions:              1
Completion Mode:          NonIndexed
Start Time:               Tue, 06 Dec 2022 13:48:25 +0000
Completed At:             Tue, 06 Dec 2022 13:49:02 +0000
Duration:                 37s
Active Deadline Seconds:  172800s
Pods Statuses:            0 Active (0 Ready) / 1 Succeeded / 0 Failed

So it sees the pod as succeeded.

This means that early on the process its gets considered as correctly done: https://github.com/galaxyproject/galaxy/blob/release_22.05/lib/galaxy/jobs/runners/kubernetes.py#L735

then some other Galaxy process detects this is in error by looking at error files I think.

My bet is that we need to make sure that the galaxy wrapping script returns a non-zero exit code to pod so that k8s decides that this pod didn't succeed.

mvdbeek commented 1 year ago

@nuwang would you have some time to look at this ?

pcm32 commented 1 year ago

Doing the same process on a container in the same k8s cluster:

root@scanpy-pod:/galaxy/server/database/jobs_directory/000/504/working# PYTHONIOENCODING=utf-8 scanpy-multiplet scrublet   --input-format 'anndata' input.h5   --show-obj stdout --output-format anndata output.h5
Killed
root@scanpy-pod:/galaxy/server/database/jobs_directory/000/504/working# echo $?
137

pcm32 commented 1 year ago

And if done properly through a pod, it produces the expected error code:

Containers:
  scanpy:
    Container ID:  containerd://1609bc42097f36331d9b16d59bbe67b7ce2fcc2f0800c24d12fe6cebbf7e842a
    Image:         quay.io/biocontainers/scanpy-scripts:1.1.6--pypyhdfd78af_0
    Image ID:      quay.io/biocontainers/scanpy-scripts@sha256:22cac5db87e70c0761aaee9e0649f83d57765064dc026dcbcb857c276ea8994e
    Port:          <none>
    Host Port:     <none>
    Command:
      /galaxy/server/database/jobs_directory/000/504/direct_run.sh
    State:          Running
      Started:      Tue, 06 Dec 2022 15:45:05 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 06 Dec 2022 15:44:44 +0000
      Finished:     Tue, 06 Dec 2022 15:44:51 +0000

pcm32 commented 1 year ago

So, is the runner able to influence how the galaxy_<job_id>.sh script is written? If you give me a few pointers of where I should be looking, I can attempt a fix.

nuwang commented 1 year ago

So Alex, Enis, Keith and I had a long conversation about this, and it took a while to figure out why the exit code was mismatching the OOMKilled message. We finally found this: https://stackoverflow.com/questions/71208960/container-restart-reason-oomkilled-with-exit-code-1 which explains what's going on. Essentially, Kubernetes is indeed sending the OOM, but to the culprit process, not PID 1. Which means that once scanpy-multiplet quits, the galaxy script continues. The script's normal behaviour is to write the child error code to a file and return exit code 0.

Therefore, it looks like we have two ways to fix this: a. What you suggested - which is to drop the `and pod.obj["status"]["phase"] == "Failed" bit. b. Change the Galaxy shell script to return certain known error codes back the job manager. OOM would probably be something that qualifies @mvdbeek?

I guess in a wider sense, the issue is about errors that job managers should be made aware of, versus internal tool errors that should be internally handled by Galaxy. In its current iteration, the script is swallowing and logging all errors, which would later translate on the Galaxy side to a job failure.

pcm32 commented 1 year ago

Thanks for taking a look! I tried option A already and it is not enough because of the success status that gets picked up much earlier. So I think we would need to go for option B, or else make the runner rely on the files for the status rather than what it gets from pykube.

mvdbeek commented 1 year ago

What you suggested - which is to drop the `and pod.obj["status"]["phase"] == "Failed" bit.

That seems like the less invasive option ?

b. Change the Galaxy shell script to return certain known error codes back the job manager. OOM would probably be something that qualifies @mvdbeek?

I think that'd be a big change ? Not sure I like the prospects of that, but if you can isolate that to kubernetes, then why not ?

On a unrelated note, y'all might want to migrate to pykube-ng ? pykube has not been updated since 2017

pcm32 commented 1 year ago

Maybe if option B is too disruptive, the k8s Job could read the galaxy_<job_id>.ec file and return the error code from there right after executing the galaxy_<job_id>.sh script? it is a bit nasty but should do what is needed. Is the .ec file always expected to be there?

pcm32 commented 1 year ago

Like this?

k8s_container = {
            "name": self.__get_k8s_container_name(ajs.job_wrapper),
            "image": container.container_id,
            # this form of command overrides the entrypoint and allows multi command
            # command line execution, separated by ;, which is what Galaxy does
            # to assemble the command.
            "command": [ajs.job_wrapper.shell],
            "args": ["-c", f"{ajs.job_file}; exit $(cat {ajs.exit_code_file})"]
            "workingDir": ajs.job_wrapper.working_directory,
            "volumeMounts": deduplicate_entries(mounts),
        }

pcm32 commented 1 year ago

It works ;-)

Containers:
  k8s:
    Container ID:  containerd://5e6ff9702e6f3e0b40b5e005ca370556d06e39b8e6a18cbd1176b2d3ef142350
    Image:         quay.io/biocontainers/scanpy-scripts:1.1.3--pypyhdfd78af_0
    Image ID:      quay.io/biocontainers/scanpy-scripts@sha256:bb2d4ca0fc8a4c3d1f9f14bf3a6de210c3aade633821d0a6ffad3064f5995003
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
    Args:
      -c
      /galaxy/server/database/jobs_directory/000/507/galaxy_507.sh; exit $(cat /galaxy/server/database/jobs_directory/000/507/galaxy_507.ec)
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 06 Dec 2022 20:50:57 +0000
      Finished:     Tue, 06 Dec 2022 20:51:28 +0000

let me know if you are happy with the approach to PR.

pcm32 commented 1 year ago

There are still other issues with the runner and the OOM detection, but the above takes us closer.

pcm32 commented 1 year ago

Ok, I have a whole set of changes that make this work:

nuwang commented 1 year ago

My thinking is that it's safer to not modify the job shell command, because the job can fail for reasons other than OOM, and k8s would now attempt to handle those failures as well, which it has no real idea how to do. It'll probably end up blindly retrying the jobs, to no avail.

Instead, can't we add some extra processing here?: https://github.com/galaxyproject/galaxy/blob/2a6f9c2a297a2ce8e4f501fc3b3e9e2d103d59d2/lib/galaxy/jobs/runners/kubernetes.py#L735

Although it's marked as succeeded, we can always retrieve the pod and check for the OOMKilled state regardless? Or perhaps, whenever "Reason" is not empty, we can assume that it's a k8s level error, and cause the job to be retried?

pcm32 commented 1 year ago

My concerns are the following:

The runner depends on the actual pod state to determine whether the pod was run succesfully or not; currently that is not working and the failure state (OOM regardless) is being assessed somehow outside of the runner via things in the file system I suspect.
The runners (in general) play an important part in the logic for achieving a status for resubmission (very specific things need to happen in order, as I painfully learnt with the LSF runner). If the runner is not even able to detect a failure, then this ability to play its part in a resubmission is hampered. It is unclear to me that resubmissions will work well with the external failure detection mechanism.
With regards to normal failures, I think we should probably remove the multiple pod retrials, as this should be handled by Galaxy's resubmission (usually allowing us to scale resources). Then k8s wouldn't be retrying hopelessly. But I think that is a matter for another issue and PR.

Catching the state at line 732 means that for every single job (because they will be all "successful"), and not only failed ones, we will be doing extra calls to the k8s API, substantially increasing traffic on a busy setup, and increasing the possibility of failure due to missed / unattended k8s API calls. Checking at that level for the different types of failures means extra calls to the API to learn about the pods individually; with the solution I proposed, we learn that (if the pod failed or not) from the original call about the k8s Job object, as the pod exit code reflects the underlying Galaxy job exit code.

pcm32 commented 1 year ago

k8s would now attempt to handle those failures as well, which it has no real idea how to do.

Well, the runner is doing very specific things:

it checks for walltime deadline trespassing and marks accordingly
it checks for OOM and marks accordingly
maybe some other checks
then mark as failed (and stop any execution of the k8s Job)

It'll probably end up blindly retrying the jobs, to no avail.

You mean due to the pods retrials? Agreed, but this part should go anyway (I would suggest that the new default is retrials = 0). The other option would be to move completely outside of k8s Jobs and deal directly with Pods, or did you have something else in mind when saying that k8s would end up blindly retrying?

nuwang commented 1 year ago

No, that's pretty much what I meant.

I can't say I have a very clear view on this:

If the job script is swallowing the 137 OOM error, then perhaps the OOM should have been handled in Galaxy's post-job error processing, and not the job handler? Somewhere internally within Galaxy, this should happen: https://github.com/galaxyproject/galaxy/blob/f2c221f48402259d3f9455e1d094950c68b66078/lib/galaxy/jobs/runners/kubernetes.py#L803 Of course, then it's no longer a runner_state, because the runner doesn't even know about it.
If on the other hand, the job script is modified to always return the error to k8s, maybe that's ok too. The job would be marked as failed, but that would be ok as long as Galaxy doesn't treat it as a failure at the DRM level, but instead, at the job level.

pcm32 commented 1 year ago

So, how do you want me to proceed:

1.- To check at the beginning, for every successful job, if they in fact failed due to OOM and we leave the failed jobs detection to Galaxy to avoid the extra pods. This has the downside of more API calls, with no immediate mitigation that I can think of at least.

or

2.- Make k8s aware that the galaxy process failed (including OOMKilled) by passing the tool exit code to k8s and let the runner handle the failure. This has the immediate effect of potentially making k8s to keep sending more pods as per that job. I could bundle a change in the retrials default or we could add an option to set the Job's spec.backoffLimit and set it by default to 1 (so no repetition after failure).

pcm32 commented 1 year ago

Galaxy doesn't treat it as a failure at the DRM level

it would be interesting to see if k8s has ways of signalling this (ie. I lost the pod because the node where it was running went away) in the pods status.

nuwang commented 1 year ago

I think Option 2 seems more attractive if we can do it safely, but could fail if the exit $(cat {ajs.exit_code_file}) line fails. I guess we could try that first, and fall back to option 1 if things don't work out?

pcm32 commented 1 year ago

One thing I'm still a bit lost with is that on failure, the database and UI don't seem to capture the stdout and stderr of the tool. On a successful job those make it somehow from ../outputs/tool_stderr and ../outputs/tool_stdout to galaxy_{id_tag}.e and galaxy_{id_tag}.o respectively (I don't seem to find where), and then those are added into the database through _job_io_for_db() calls with the finish part. On failure though ../outputs/tool_stderr and ../outputs/tool_stdout never make it to galaxy_{id_tag}.e and galaxy_{id_tag}.o, and nor are these uploaded to the DB.

Previously this worked because all jobs went through the finish method, but doing that breaks the resubmissions I think.

Any pointers appreciated.

pcm32 commented 1 year ago

Locally, I added a def fail_job that overrides the parent to call _job_io_for_db(), but since galaxy_{id_tag}.e and galaxy_{id_tag}.o, it makes no difference :-(.

bgruening commented 1 year ago

../outputs/tool_stderr and ../outputs/toolstdout make it to galaxy{idtag}.e and galaxy{id_tag}.o

This looks wrong to me. Is that a mistake in the k8s runner? To my understanding both are separate files with separate meanings and the condor runner has different content in them.

pcm32 commented 1 year ago

This is fixed in #15238.

galaxyproject / galaxy

Kubernetes runner doesn't detect OOMKilled properly #15118