aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
436 stars 190 forks source link

detailed job info can be unreliable #4646

Open ltalirz opened 3 years ago

ltalirz commented 3 years ago

When retrieving CalcJobs from a SLURM cluster, AiiDA uses the squeue output to determine when to retrieve a calculation, and then does a final call to sacct to get more detailed job information from the scheduler.

https://github.com/aiidateam/aiida-core/blob/aa9a2cb519f96fef24746a7ffb8e5701107f2503/aiida/schedulers/plugins/slurm.py#L233

Recent tests on EPFL's fidis cluster showed that for 10/10 failed calculations, the detailed job information still reported the job state RUNNING.

{'retval': 0,
 'stderr': '',
 'stdout': 'AllocCPUS|Account|AssocID|AveCPU|AvePages|AveRSS|AveVMSize|Cluster|Comment|CPUTime|CPUTimeRAW|DerivedExitCode|Elapsed|Eligible|End|ExitCode|GID|Group|JobID|JobName|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Priority|Partition|QOSRAW|ReqCPUS|Reserved|ResvCPU|ResvCPURAW|Start|State|Submit|Suspended|SystemCPU|Timelimit|TotalCPU|UID|User|UserCPU|
28|lsmo|470|||||fidis||00:13:04|784|0:0|00:00:28|2021-01-05T18:30:48|Unknown|0:0|10697|lsmo|6674932|aiida-491988||||||||||28|1|f296||318|parallel|1|28|00:00:04|00:01:52|112|2021-01-05T18:30:52|RUNNING|2021-01-05T18:30:48|00:00:00|00:00:00|2-02:00:00|00:00:00|162182|ongari|00:00:00|
'}

(replaced \n by line breaks for better readability).

Running the command manually at a later point in time correctly reports the job as failed:

[ongari@fidis ~]$ sacct -j 6674932
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
6674932      aiida-491+   parallel       lsmo         28     FAILED      9:0
6674932.bat+      batch                  lsmo         28     FAILED      9:0
6674932.ext+     extern                  lsmo         28  COMPLETED      0:0
6674932.0     cp2k.popt                  lsmo         28     FAILED      1:0

This seems to indicate that the information available to sacct was outdated with respect to the information returned by squeue (which AiiDA uses to determine when to retrieve a calculation).

The detailed job info is not processed by AiiDA automatically, it is of informational nature - nevertheless it would be important to know when this information can be trusted.

We are looking further into the issue on the cluster.

Mentioning @danieleongari for info

giovannipizzi commented 3 years ago

Indeed, it would be good to have an answer on this from the cluster maintainers