aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
435 stars 188 forks source link

direct scheduler: kill the job after time exceeded. #3717

Open yakutovicha opened 4 years ago

yakutovicha commented 4 years ago

I noticed, that "max_wallclock_seconds" parameter is not taken into account by the direct job scheduler.

I haven't look at the implementation yet, but the net result is: the job keeps running independently whether the time is exceeded or not. Shouldn't it be just killed by AiiDA?

sphuber commented 4 years ago

I agree that that would be the expected behavior, but I am not sure if it is that easy to implement. The problem is that the direct scheduler is not an actual scheduler. So there is no external process checking on the calculation job and killing it when necessary. If we want this behavior, this should then be implemented in our own engine. We could perhaps schedule a call back in the event loop of the runner that is taking care of a CalcJob on a Computer with the direct scheduler to kill it X seconds after it started where X = max_wallclock_seconds if set. This is in principle doable, but the tricky part is to make sure this event is properly rescheduled if the daemon is stopped and resumed. But maybe all of the complexity is not worthwhile given that the direct scheduler really is intended mostly for testing and debugging.

greschd commented 4 years ago

This could be solved by adding the timeout command in the launch script: http://man7.org/linux/man-pages/man1/timeout.1.html

I'm not sure about cross-platform availability though -- and one should probably check if it interacts nicely with mpirun.

yakutovicha commented 4 years ago

This could be solved by adding the timeout command in the launch script: http://man7.org/linux/man-pages/man1/timeout.1.html

Thanks, @greschd for the nice suggestion. It helps me to fix the problem (at least temporarily)

I'm not sure about cross-platform availability though -- and one should probably check if it interacts nicely with mpirun.

I checked that with mpirun on MacOS and it works independently whether I put it before mpirun:

'timeout' '10' 'mpirun' '-np' '2' '/usr/local/bin/cp2k.popt' '-i' 'aiida.inp'  > 'aiida.out' 2>&1

or after it

'mpirun' '-np' '2' 'timeout' '10' '/usr/local/bin/cp2k.popt' '-i' 'aiida.inp'  > 'aiida.out' 2>&1
yakutovicha commented 4 years ago

Btw, @sphuber that raises another question that comes to me from time to time: is there any official way to add something right before the code path:

<HERE> '/usr/local/bin/cp2k.popt' '-i' 'aiida.inp'  > 'aiida.out' 2>&1

timeout is a good candidate for this position. I didn't find a proper way to do it not in CodeInfo nor in CalcInfo

sphuber commented 4 years ago

If you are using mpirun, you should be able to use metadata.option.mpirun_extra_params, e.g.:

inputs = {
    'metadata': {
        'options': {
            'withmpi': True,
            'mpirun_extra_params': ['timeout', '10']
        }
    }
}

However, this only works when running with MPI. If not, this is not possible. You might want to open up a feature request in that case.

yakutovicha commented 4 years ago

However, this only works when running with MPI. If not, this is not possible. You might want to open up a feature request in that case.

I think it is a nice feature to have. I will open the request (#3731)