Open yakutovicha opened 4 years ago
I agree that that would be the expected behavior, but I am not sure if it is that easy to implement. The problem is that the direct scheduler is not an actual scheduler. So there is no external process checking on the calculation job and killing it when necessary. If we want this behavior, this should then be implemented in our own engine. We could perhaps schedule a call back in the event loop of the runner that is taking care of a CalcJob
on a Computer
with the direct scheduler to kill it X
seconds after it started where X = max_wallclock_seconds
if set. This is in principle doable, but the tricky part is to make sure this event is properly rescheduled if the daemon is stopped and resumed. But maybe all of the complexity is not worthwhile given that the direct scheduler really is intended mostly for testing and debugging.
This could be solved by adding the timeout
command in the launch script: http://man7.org/linux/man-pages/man1/timeout.1.html
I'm not sure about cross-platform availability though -- and one should probably check if it interacts nicely with mpirun.
This could be solved by adding the timeout command in the launch script: http://man7.org/linux/man-pages/man1/timeout.1.html
Thanks, @greschd for the nice suggestion. It helps me to fix the problem (at least temporarily)
I'm not sure about cross-platform availability though -- and one should probably check if it interacts nicely with mpirun.
I checked that with mpirun on MacOS and it works independently whether I put it before mpirun:
'timeout' '10' 'mpirun' '-np' '2' '/usr/local/bin/cp2k.popt' '-i' 'aiida.inp' > 'aiida.out' 2>&1
or after it
'mpirun' '-np' '2' 'timeout' '10' '/usr/local/bin/cp2k.popt' '-i' 'aiida.inp' > 'aiida.out' 2>&1
Btw, @sphuber that raises another question that comes to me from time to time: is there any official way to add something right before the code path:
<HERE> '/usr/local/bin/cp2k.popt' '-i' 'aiida.inp' > 'aiida.out' 2>&1
timeout
is a good candidate for this position.
I didn't find a proper way to do it not in CodeInfo nor in CalcInfo
If you are using mpirun, you should be able to use metadata.option.mpirun_extra_params
, e.g.:
inputs = {
'metadata': {
'options': {
'withmpi': True,
'mpirun_extra_params': ['timeout', '10']
}
}
}
However, this only works when running with MPI. If not, this is not possible. You might want to open up a feature request in that case.
However, this only works when running with MPI. If not, this is not possible. You might want to open up a feature request in that case.
I think it is a nice feature to have. I will open the request (#3731)
I noticed, that
"max_wallclock_seconds"
parameter is not taken into account by the direct job scheduler.I haven't look at the implementation yet, but the net result is: the job keeps running independently whether the time is exceeded or not. Shouldn't it be just killed by AiiDA?