Closed mikeatm closed 3 months ago
I don't think this is a problem with the SLURM scheduler being unresponsive. Yes, that did happen, but the engine did what it had to do and retried, before pausing it after 5 failed attempts. From the logs, it seems that at some point you played the process again and then it could once again connect to SLURM. By that time, the job had actually finished as it was killed by SLURM due to an Out-Of-Walltime error. This was correctly parsed by the SlurmScheduler
and assigned the exit code 120
. The real problem is that the PwBaseWorkChain
didn't handle this exit code by resubmitting it. The solution should be to add a process handler to the PwBaseWorkChain
for the ERROR_SCHEDULER_OUT_OF_WALLTIME
exit code in which case the job should be resubmitted. The only open question is that in this case SLURM killed the job, so QE won't necessarily have had the time to write its current state to restart files. So the restart may actually crash hard due to corrupt output files.
What were the inputs for your calculation? How big was the system and how much time did you request from the scheduler?
Hmm, you are right, looks like i explicity set the time:
builder = PwBaseWorkChain.get_builder_from_protocol(code, struc, protocol="fast",options=options, overrides= inputs_scf )
where options is:
options= {
'custom_scheduler_commands':
'#SBATCH --partition=normal\n#SBATCH -A physics\nexport OMP_NUM_THREADS={}\n'.format(
OMPNUMTHREAD),
'resources': {
'num_machines': NUMMACHINE,
'num_cores_per_machine': NUMCOREPER,
'num_mpiprocs_per_machine':NUMCOREPER//OMPNUMTHREAD,
},
'max_wallclock_seconds': HOURS*60*60-100,
}
i picked 100 seconds less tham 24 hours limit. how do i use the 24 hours with the workflow, with the time set on the qe input max_seconds? do i set it manually on the overrides? How was this designed to work and restart through the workflow to prevent QE being killed (gracefull stop)
The workflow will automatically set the CONTROL.max_seconds
parameter to 95% of the requested walltime.
https://github.com/aiidateam/aiida-quantumespresso/blob/9cb1cfa8a70d19af7aaa1b624cf17c8babe93f41/src/aiida_quantumespresso/workflows/pw/base.py#L33
For certain cases, the remaining 5% may still not be sufficient for QE to gracefully shutdown. Unfortunately the 95% value is hardcoded and cannot be configured through inputs. What you can try as a temporary solution is to update that value in the source code.
You may also want to double check the input parameters
of the PwCalculation
to make sure the CONTROL.max_seconds
was properly set.
So my scheduler (slurm) seems to have been unavailable at some point during the workchain execution, and on walltime exhaustion, there was no restart of a pwcalculation:
This it the unrecoverable problem (part of the log),
Is it possible to handle this with a
pause
rather than Finished [300] or alter the do_update to pause after 5 to be configurable to pause or maintain default (current behaviour)?