Handling UNKWN and ZOMBI status

leoisl commented 4 years ago

It seems that when a job is submitted or running in an unreachable host (e.g. host was reachable when job was submitted to it, and while it was executing, it became unreachable), its status becomes UNKWN: https://www.ibm.com/support/pages/how-requeue-lsf-jobs-unknwn-status-when-remote-execution-host-becomes-unreachable The current profile will consider that an UNKWN job is still running, so will not try to kill it: https://github.com/Snakemake-Profiles/lsf/blob/2e6f23cbea58bb07bde5eff873be6bc87f2a4018/%7B%7Bcookiecutter.profile_name%7D%7D/lsf_status.py#L39 . I actually don't know if it is better to wait for this job to change its status eventually, or to simply directly kill the job, and try to resubmit it. Killing an UNKWN job should be done with bkill -r: https://www.ibm.com/support/pages/how-requeue-lsf-jobs-unknwn-status-when-remote-execution-host-becomes-unreachable . Simply bkill won't do anything. This will take the job from UNKWN to ZOMBI, and then EXIT. Currently, the pipeline does not know the ZOMBI status: [Predicted exception] Unknown job status: 'ZOMBI' which will cause it to try some more times to get the status, and then eventually give up and check the log.

There are several approaches we can handle UNKWN and ZOMBI:

Add ZOMBI to STATUS_TABLE and put as RUNNING (UNKWN is already like this), as eventually the ZOMBI job will become EXIT, and then we recognise it failed;
When we see UNKWN, we bkill -r it. When we see ZOMBI, we say that the job FAILED;

Option 1 seems to need manual intervention though... An UNKWN job might return to a valid state if the execution host becomes reachable again (I think execution hosts become unreachable when there is an actual issue with the host, and thus they need to be rebooted anyway? So the job is lost anyway?). So the user might want to wait for the UNKWN job to return to a valid state, or bkill -r it, and then it becomes ZOMBI and EXIT.

Option 2 is more automatic, but requires more development and is more aggressive: as soon as we see UNKWN, we bkill -r it and resubmit it. I prefer option 2, as if the execution host became unreachable, I usually prefer to kill the job and submit to a healthy host than waiting for an unknown period of time to maybe it become reachable again.

In any case, Option 1 is already what is sort of implemented. User has to manually kill these jobs, and ZOMBI state is not recognised, but if everything fails we eventually go look at LSF log. So this is not an urgent issue, but maybe sth nice to fix at some point.

PS: there is a more annoying case where some jobs had the status of RUN for almost 1 day, and not a single line was executed. I think it might be related with this issue, but somehow LSF did not manage to tag these jobs as UNKWN. We can retrieve how much computing time was done in a job with bjobs -l:

Sun Aug 30 16:29:56: Resource usage collected.
                     The CPU time used is 71 seconds.

I am sure there is a better way that would allow us to query just the CPU time.

It would be nice also to deal with this, as my pipelines actually got stuck, and I thought they were just taking long, but actually nothing was being run... It seems to me that this issue happens when the execution host somehow can't execute anything. It might be solvable also with a LSF preexec command (on the hypothesis that the execution can't execute anything, it won't be able to execute a simple echo), or with this constant resource usage querying

mbhall88 commented 4 years ago

I have a strong preference for Option 1. In my experience, UNKWN jobs generally progress back to running eventually. I don't think it is our responsibility to deal with people's cluster issues as we almost end up implementing a cluster scheduling system ourselves. Each user's cluster system will have it's own issues and quirks so I am not keen to start managing "stuck" jobs.

leoisl commented 4 years ago

Agreed!

mbhall88 commented 3 years ago

Sorry @leoisl I realised we never implemented the handling of ZOMBI status. Also, I've been seeing a lot of UNKWN jobs in the last few days that aren't completing, so I have added an option, when setting up the profile, for the user to say whether they want to wait for UNKWN jobs, or kill them (using the method you linked to). I toyed with the idea of requeueing them, but this could create issues with files that have been created already by the job.
I'm currently testing out https://github.com/mbhall88/snakemake-lsf/commit/ca6632106c681e1d3f84d34c2ea6400f659f64b6 locally. If it doesn't crash and burn I'll put in a PR

leoisl commented 3 years ago

Hello, yeah, I really like the PR. I am not sure though if we are doing proper cleanup, since it is tricky. Not sure if we should bother neither. What I mean is related to this (from https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_admin/job_kill_force.html):

If the job is in UNKNWN state, bkill -r marks the job as ZOMBIE state. bkill -r changes jobs in ZOMBIE state to EXIT, and job resources that LSF monitors are released as soon as LSF receives the first signal.

So if the user wants to kill UNKWN jobs, it will issue bkill -r, which will mark the job as ZOMBIE, and only that. It seems that the job resources are not released, so LSF is still keeping the resources (CPUs, RAM, etc) for the ZOMBIE jobs. I don't actually know exactly how snakemake proceeds in this case. Let me explain better the scenario:

1. Job status is UNKWN;
2. profile issues bkill -r <job id>;
3. Job status goes to ZOMBI;
4. profile returns failed to snakemake: the job failed;
5. snakemake realises the job failed, resubmit the job (this is the part where I am in doubt);

The issue here is that it seems we did not do proper cleanup of LSF resources, as we don't issue a second bkill -r when after the job becomes ZOMBI. I don't know if adding this behaviour to the profile (i.e. if job status == ZOMBI then bkill -r <job id> to cleanup and return "failed") works because it might be that snakemake just stop tracking the job after step 5. It seems to me after seeing that the job failed, snakemake just forgets about the job and resubmits. This is one way to maybe do a cleanup:

1. Job status is UNKWN;
2. profile issues bkill -r <job id>;
3. Job status goes to ZOMBI;
4. profile returns "running" to snakemake: the job still did not fail;
5. snakemake does another status check, and profile sees that the job status is ZOMBI;
6. profile issues another bkill -r <job id>, effectively cleaning up LSF resources allocated to the job;
7. profile returns "failed" to snakemake: the job failed;

I am describing this in details because I am unsure how exactly LSF/snakemake behaves (this is my best guess). And currently our testing framework is based on mocking LSF, so we have to know how exactly LSF works to be able to mock it effectively. I wonder if we could talk to systems to simulate an UNKWN job, and confirm if what I wrote is true or not, and code accordingly.

PS: all of this might be a misinterpretation of my part?

... and job resources that LSF monitors are released as soon as LSF receives the first signal.

Maybe when they say this, it means that the job resources are released when receiving the first bkill -r, not the second? Argh, it is tricky if we can't easily reproduce the case... Maybe we don't need to reproduce, but just ask systems if they know this better than us (or maybe you already know, and I am misinterpreting stuff)

mbhall88 commented 3 years ago

I think cleaning up is a good idea. I just noticed I have a bunch of ZOMBI jobs from the last week. bkill -r seems to have worked and moved them to EXIT. I'll put in a cleanup now.

Snakemake-Profiles / lsf

Handling UNKWN and ZOMBI status #28