PwBaseWorkchain Scheduler Outage handling.

mikeatm commented 3 months ago

So my scheduler (slurm) seems to have been unavailable at some point during the workchain execution, and on walltime exhaustion, there was no restart of a pwcalculation:

$ verdi process report 89674
2024-03-11 21:47:23 [2266 | REPORT]: [89674|PwBaseWorkChain|run_process]: launching PwCalculation<89683> iteration #1
2024-03-14 10:56:25 [2314 | REPORT]: [89674|PwBaseWorkChain|report_error_handled]: PwCalculation<89683> failed with exit status 120: The job ran out of walltime.
2024-03-14 10:56:25 [2315 | REPORT]: [89674|PwBaseWorkChain|report_error_handled]: Action taken: unrecoverable error, aborting...
2024-03-14 10:56:25 [2316 | REPORT]: [89674|PwBaseWorkChain|inspect_process]: PwCalculation<89683> failed but a handler detected an unrecoverable problem, aborting
2024-03-14 10:56:25 [2317 | REPORT]: [89674|PwBaseWorkChain|on_terminated]: remote folders will not be cleaned

This it the unrecoverable problem (part of the log),

+-> WARNING at 2024-03-12 11:02:23.983674+03:00
 | maximum attempts 5 of calling do_update, exceeded
+-> ERROR at 2024-03-13 10:00:09.828838+03:00
 | Traceback (most recent call last):
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
 |     result = await coro()
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 193, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/utils.py", line 94, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/asyncio/tasks.py", line 258, in __step
 |     result = coro.throw(exc)
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 180, in updating
 |     await self._update_job_info()
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 132, in _update_job_info
 |     self._jobs_cache = await self._get_jobs_from_scheduler()
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 98, in _get_jobs_from_scheduler
 |     transport = await request
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/asyncio/futures.py", line 284, in __await__
 |     yield self  # This tells Task to wait for completion.
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
 |     future.result()
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/transports.py", line 89, in do_open
 |     transport.open()
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/transports/plugins/ssh.py", line 516, in open
 |     self._client.connect(self._machine, **connection_arguments)
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/paramiko/client.py", line 358, in connect
 |     retry_on_signal(lambda: sock.connect(addr))
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/paramiko/util.py", line 279, in retry_on_signal
 |     return function()
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/paramiko/client.py", line 358, in <lambda>
 |     retry_on_signal(lambda: sock.connect(addr))
 | socket.timeout: timed out
+-> ERROR at 2024-03-13 17:00:47.936726+03:00
 | Traceback (most recent call last):
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
 |     result = await coro()
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 193, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/utils.py", line 94, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/asyncio/tasks.py", line 256, in __step
 |     result = coro.send(None)
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 180, in updating
 |     await self._update_job_info()
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 132, in _update_job_info
 |     self._jobs_cache = await self._get_jobs_from_scheduler()
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 109, in _get_jobs_from_scheduler
 |     scheduler_response = scheduler.get_jobs(**kwargs)
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/schedulers/scheduler.py", line 361, in get_jobs
 |     joblist = self._parse_joblist_output(retval, stdout, stderr)
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/schedulers/plugins/slurm.py", line 476, in _parse_joblist_output
 |     raise SchedulerError(
 | aiida.schedulers.scheduler.SchedulerError: squeue returned exit code 1 (_parse_joblist_output function)
 | stdout=''
 | stderr='slurm_load_jobs error: Unable to contact slurm controller (connect failure)'
+-> ERROR at 2024-03-13 17:01:47.559816+03:00
 | Traceback (most recent call last):
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
 |     result = await coro()
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 193, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/utils.py", line 94, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/asyncio/tasks.py", line 256, in __step
 |     result = coro.send(None)
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 180, in updating
 |     await self._update_job_info()
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 132, in _update_job_info
 |     self._jobs_cache = await self._get_jobs_from_scheduler()
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 109, in _get_jobs_from_scheduler
 |     scheduler_response = scheduler.get_jobs(**kwargs)
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/schedulers/scheduler.py", line 361, in get_jobs
 |     joblist = self._parse_joblist_output(retval, stdout, stderr)
 |   File "/home/mike/.conda/envs/aiida/lib/python3.9/site-packages/aiida/schedulers/plugins/slurm.py", line 476, in _parse_joblist_output
 |     raise SchedulerError(
 | aiida.schedulers.scheduler.SchedulerError: squeue returned exit code 1 (_parse_joblist_output function)
 | stdout=''
 | stderr='slurm_load_jobs error: Unable to contact slurm controller (connect failure)'
+-> WARNING at 2024-03-14 10:56:12.938641+03:00
 | scheduler parser returned exit code<120>: The job ran out of walltime.
+-> WARNING at 2024-03-14 10:56:23.066278+03:00
 | key 'symmetries' is not present in raw output dictionary
+-> ERROR at 2024-03-14 10:56:23.720168+03:00
 | ERROR_OUTPUT_STDOUT_INCOMPLETE
+-> WARNING at 2024-03-14 10:56:24.002272+03:00
 | output parser returned exit code<120>: The job ran out of walltime.

Is it possible to handle this with a pause rather than Finished [300] or alter the do_update to pause after 5 to be configurable to pause or maintain default (current behaviour)?

89674  2D ago     PwBaseWorkChain               ⏹ Finished [300]
89679  2D ago     create_kpoints_from_distance  ⏹ Finished [0]
89683  2D ago     PwCalculation                 ⏹ Finished [120]

sphuber commented 3 months ago

I don't think this is a problem with the SLURM scheduler being unresponsive. Yes, that did happen, but the engine did what it had to do and retried, before pausing it after 5 failed attempts. From the logs, it seems that at some point you played the process again and then it could once again connect to SLURM. By that time, the job had actually finished as it was killed by SLURM due to an Out-Of-Walltime error. This was correctly parsed by the SlurmScheduler and assigned the exit code 120. The real problem is that the PwBaseWorkChain didn't handle this exit code by resubmitting it. The solution should be to add a process handler to the PwBaseWorkChain for the ERROR_SCHEDULER_OUT_OF_WALLTIME exit code in which case the job should be resubmitted. The only open question is that in this case SLURM killed the job, so QE won't necessarily have had the time to write its current state to restart files. So the restart may actually crash hard due to corrupt output files.

What were the inputs for your calculation? How big was the system and how much time did you request from the scheduler?

mikeatm commented 3 months ago

Hmm, you are right, looks like i explicity set the time:

builder = PwBaseWorkChain.get_builder_from_protocol(code, struc, protocol="fast",options=options, overrides= inputs_scf )

where options is:

options= {
            'custom_scheduler_commands':
                   '#SBATCH --partition=normal\n#SBATCH -A physics\nexport OMP_NUM_THREADS={}\n'.format(
                           OMPNUMTHREAD),
            'resources': {
                          'num_machines': NUMMACHINE,
                          'num_cores_per_machine': NUMCOREPER,
                          'num_mpiprocs_per_machine':NUMCOREPER//OMPNUMTHREAD,
                          },
            'max_wallclock_seconds': HOURS*60*60-100,
             }

i picked 100 seconds less tham 24 hours limit. how do i use the 24 hours with the workflow, with the time set on the qe input max_seconds? do i set it manually on the overrides? How was this designed to work and restart through the workflow to prevent QE being killed (gracefull stop)

sphuber commented 3 months ago

The workflow will automatically set the CONTROL.max_seconds parameter to 95% of the requested walltime. https://github.com/aiidateam/aiida-quantumespresso/blob/9cb1cfa8a70d19af7aaa1b624cf17c8babe93f41/src/aiida_quantumespresso/workflows/pw/base.py#L33

For certain cases, the remaining 5% may still not be sufficient for QE to gracefully shutdown. Unfortunately the 95% value is hardcoded and cannot be configured through inputs. What you can try as a temporary solution is to update that value in the source code. You may also want to double check the input parameters of the PwCalculation to make sure the CONTROL.max_seconds was properly set.

aiidateam / aiida-quantumespresso

PwBaseWorkchain Scheduler Outage handling. #1014