Double check calculation job really has finished from detailed scheduler info when normal scheduler update says job is terminated

aiidateam / aiida-core

The official repository for the AiiDA code

https://aiida-core.readthedocs.io

Other

434 stars 188 forks source link

Double check calculation job really has finished from detailed scheduler info when normal scheduler update says job is terminated #2431

Open ltalirz opened 5 years ago

ltalirz commented 5 years ago

Once a job has been dropped from the queue, it would be great if AiiDA would query the scheduler for the job status rather than assuming that the job completed correctly (e.g. to detect whether the walltime was exceeded, the job was cancelled via a scheduler command, ...).

When AiiDA thinks a job is finished, it could simply run something like (add --format=State to just show the state)

$ sacct  --parsable --jobs=11566621
JobID|JobName|Partition|Account|AllocCPUS|State|ExitCode|
11566621|aiida-140|normal|s888|192|TIMEOUT|0:0|
11566621.batch|batch||s888|24|CANCELLED|0:15|
11566621.extern|extern||s888|192|COMPLETED|0:0|
11566621.0|pw.x||s888|96|FAILED|1:0|

The slurm plugin already has commands that use sacct, e.g. here: https://github.com/aiidateam/aiida_core/blob/6e9711046753332933f982971db1d7ac7e7ade58/aiida/scheduler/plugins/slurm.py#L225

@sphuber suggest to implement this in the form of a builtin generic scheduler output parser that is called before the actual calculation parser.

Ideally we would reserved some exit codes for scheduler specific errors, that will be available to the calculation parser, which can then decide what to do, if they want to simply return the scheduler error code or do some parsing before

ltalirz commented 5 years ago

See the mailing list for the initial discussion.

@sphuber writes:

This would require however that this functionality would have to be supported by all the scheduler plugins.

That would of course be ideal but I would say this is not necessary in order to get started. If a scheduler plugin does not support getting the "job exit status", then we can simply default to the current behavior (which assumes everything went fine).

zhubonan commented 5 years ago

If we just want to know if a job has reached the end of execution, a simple solution would be to append a command each as touch .JOB_DONE to the job submission script. This way it won't be cluster/scheduler dependent.

ltalirz commented 5 years ago

@zhubonan thanks for the suggestion. To me, the purpose of this is more to profit from the information that schedulers already provide (instead of re-implementing this functionality ourselves).

adegomme commented 5 years ago

I would also like this feature, as sometimes jobs end up with timeouts or oom_kill when memory is exhausted, and aiida states that everything was fine, which is not really logical.

sphuber commented 5 years ago

I will start working on this

ltalirz commented 4 years ago

@sphuber In view of https://github.com/aiidateam/aiida-core/issues/4326 , can I ask: what happens now (in the slurm case), if upon retrieval the job is still found to be RUNNING?

This signals a bug in AiiDA's information gathering and could in principle be used to stop retrieval and put the job back online.

sphuber commented 4 years ago

Although I closed this issue with PR #3906 I see now that the functionality provided by it does not address it. The branch that I used to implement it was linked to this issue and I created it a long time ago, so not sure why I labeled it as such at the time. I will reopen the issue.

Note that the code to get detailed job info is already present and was addressed in 97658d8570db14aed46780733f561be9459e925e. This is implemented for SLURM and so the final state of the job is stored in the detailed_job_info attribute. This is done in the retrieved task and so really is too late to use it for what you are suggesting. If the update step incorrectly deduces that the job is finished, while it is not, it should be fixed there. Are you saying that it is impossible to have the update check the status correctly and deterministically?

ltalirz commented 4 years ago

This is done in the retrieved task and so really is too late to use it for what you are suggesting

Ok.

If the update step incorrectly deduces that the job is finished, while it is not, it should be fixed there. Are you saying that it is impossible to have the update check the status correctly and deterministically?

We'll try our best but who knows ;-) On the other hand, it is certainly possible (and has evidently already happened) that the retrieved status is not correct, so at the very least one should warn the user if the job is found to still be RUNNING.

To repeat my question: what is the current behavior in this case?

sphuber commented 4 years ago

To repeat my question: what is the current behavior in this case?

There is currently no code in the engine that automatically checks the detailed job info for the state RUNNING and react on it. I have an open PR #3931 that implements the new parser method that just got merged in #3906 , which actually does inspect this detailed job info. The intention there however is to inspect the results assuming the job has actually terminated. We could add logic to revert the CalcJob state, but I don't think that is the way to go. We should really first try all we can to fix the update. If that really turns out to be impossible, we can start to look what we could do in the SlurmScheduler.parse_output method to catch it.

ltalirz commented 4 years ago

The intention there however is to inspect the results assuming the job has actually terminated. We could add logic to revert the CalcJob state, but I don't think that is the way to go. We should really first try all we can to fix the update.

We will certainly try to fix the update, but what about adding at least a simple check that raises a warning or an exception when 'RUNNING' is detected as the job status? Given that these things have happened and I'm not sure how easy it is to completely get rid of them (how do we prove the negative?), I think any extra information we can provide to users is potentially helpful.