Open ltalirz opened 5 years ago
See the mailing list for the initial discussion.
@sphuber writes:
This would require however that this functionality would have to be supported by all the scheduler plugins.
That would of course be ideal but I would say this is not necessary in order to get started. If a scheduler plugin does not support getting the "job exit status", then we can simply default to the current behavior (which assumes everything went fine).
If we just want to know if a job has reached the end of execution, a simple solution would be to append a command each as touch .JOB_DONE
to the job submission script. This way it won't be cluster/scheduler dependent.
@zhubonan thanks for the suggestion. To me, the purpose of this is more to profit from the information that schedulers already provide (instead of re-implementing this functionality ourselves).
I would also like this feature, as sometimes jobs end up with timeouts or oom_kill when memory is exhausted, and aiida states that everything was fine, which is not really logical.
I will start working on this
@sphuber In view of https://github.com/aiidateam/aiida-core/issues/4326 , can I ask: what happens now (in the slurm case), if upon retrieval the job is still found to be RUNNING
?
This signals a bug in AiiDA's information gathering and could in principle be used to stop retrieval and put the job back online.
Although I closed this issue with PR #3906 I see now that the functionality provided by it does not address it. The branch that I used to implement it was linked to this issue and I created it a long time ago, so not sure why I labeled it as such at the time. I will reopen the issue.
Note that the code to get detailed job info is already present and was addressed in 97658d8570db14aed46780733f561be9459e925e. This is implemented for SLURM and so the final state of the job is stored in the detailed_job_info
attribute. This is done in the retrieved task and so really is too late to use it for what you are suggesting. If the update
step incorrectly deduces that the job is finished, while it is not, it should be fixed there. Are you saying that it is impossible to have the update
check the status correctly and deterministically?
This is done in the retrieved task and so really is too late to use it for what you are suggesting
Ok.
If the update step incorrectly deduces that the job is finished, while it is not, it should be fixed there. Are you saying that it is impossible to have the update check the status correctly and deterministically?
We'll try our best but who knows ;-)
On the other hand, it is certainly possible (and has evidently already happened) that the retrieved status is not correct, so at the very least one should warn the user if the job is found to still be RUNNING
.
To repeat my question: what is the current behavior in this case?
To repeat my question: what is the current behavior in this case?
There is currently no code in the engine that automatically checks the detailed job info for the state RUNNING
and react on it. I have an open PR #3931 that implements the new parser method that just got merged in #3906 , which actually does inspect this detailed job info. The intention there however is to inspect the results assuming the job has actually terminated. We could add logic to revert the CalcJob
state, but I don't think that is the way to go. We should really first try all we can to fix the update
. If that really turns out to be impossible, we can start to look what we could do in the SlurmScheduler.parse_output
method to catch it.
The intention there however is to inspect the results assuming the job has actually terminated. We could add logic to revert the CalcJob state, but I don't think that is the way to go. We should really first try all we can to fix the update.
We will certainly try to fix the update, but what about adding at least a simple check that raises a warning or an exception when 'RUNNING' is detected as the job status? Given that these things have happened and I'm not sure how easy it is to completely get rid of them (how do we prove the negative?), I think any extra information we can provide to users is potentially helpful.
Once a job has been dropped from the queue, it would be great if AiiDA would query the scheduler for the job status rather than assuming that the job completed correctly (e.g. to detect whether the walltime was exceeded, the job was cancelled via a scheduler command, ...).
When AiiDA thinks a job is finished, it could simply run something like (add --format=State to just show the state)
The slurm plugin already has commands that use sacct, e.g. here: https://github.com/aiidateam/aiida_core/blob/6e9711046753332933f982971db1d7ac7e7ade58/aiida/scheduler/plugins/slurm.py#L225
@sphuber suggest to implement this in the form of a builtin generic scheduler output parser that is called before the actual calculation parser.