When retrieving CalcJobs from a SLURM cluster, AiiDA uses the squeue output to determine when to retrieve a calculation, and then does a final call to sacct to get more detailed job information from the scheduler.
This seems to indicate that the information available to sacct was outdated with respect to the information returned by squeue (which AiiDA uses to determine when to retrieve a calculation).
The detailed job info is not processed by AiiDA automatically, it is of informational nature - nevertheless it would be important to know when this information can be trusted.
We are looking further into the issue on the cluster.
When retrieving CalcJobs from a SLURM cluster, AiiDA uses the
squeue
output to determine when to retrieve a calculation, and then does a final call tosacct
to get more detailed job information from the scheduler.https://github.com/aiidateam/aiida-core/blob/aa9a2cb519f96fef24746a7ffb8e5701107f2503/aiida/schedulers/plugins/slurm.py#L233
Recent tests on EPFL's fidis cluster showed that for 10/10 failed calculations, the detailed job information still reported the job state
RUNNING
.(replaced
\n
by line breaks for better readability).Running the command manually at a later point in time correctly reports the job as failed:
This seems to indicate that the information available to
sacct
was outdated with respect to the information returned bysqueue
(which AiiDA uses to determine when to retrieve a calculation).The detailed job info is not processed by AiiDA automatically, it is of informational nature - nevertheless it would be important to know when this information can be trusted.
We are looking further into the issue on the cluster.
Mentioning @danieleongari for info