Gracefully kill running tasks before walltime (Slurm)

pdobbelaere commented 1 week ago

Is your feature request related to a problem? Please describe.

Does Parsl (more specifically HighTroughputExecutor+SlurmProvider) provide some built-in method to notify running tasks when their job allocation is about to hit walltime? Task runtimes are not always predictable, and some option to gracefully kill a task (close/checkpoint files, prep for a restart) could prevent losing workflow progress. I know about the drain option for HTEx, but it does not affect running jobs, if I understood correctly.

Describe alternatives you've considered

In Slurm, you can use the --signal flag to send a signal before walltime, however I have not found an easy way to propagate this signal to tasks running through workers.
You could wrap bash apps with timeout (e.g, timeout 60m python myscript.py), but that does not really work for tasks started halfway through the job allocation (you don't know how much walltime will be left when any task starts).
Job allocation details can be found in the shell environment (with Slurm, anyway), so every app could manage and decide when to shutdown by itself. I would say this is not the responsibility of apps.
You could play it safe and always checkpoint periodically. Brute-forcing it should work in most scenarios, but feels somewhat inelegant.

Currently, my hacky workaround is to launch a simple background Python script - before starting the process_worker_pool - which sleeps until right before the job allocation ends and then signals any (sub)processes created by workers (see below). This approach seems to work fine, but is bound to fail under some circumstances. There must be a better/cleaner way.

graceful_shutdown.py

``` """ Set a shutdown timer and kill everything TODO: make exit_window variable """ import os import signal import psutil import datetime EXIT_SIGNAL = signal.SIGUSR1 EXIT_WINDOW = 30 # seconds EXIT_CODE = 69 def signal_handler(signum, frame): """""" pass def signal_handler_noop(signum, frame): """Dummy handler that does not do anything.""" print(f'Received signal {signum} at frame {frame}') print('What do I do with this omg..') exit(EXIT_CODE) def find_processes() -> list[psutil.Process]: """""" pkill = psutil.Process() pmain = pkill.parent() job_id, node_id = [pmain.environ().get(_) for _ in ('SLURM_JOBID', 'SLURM_NODELIST')] procs_node = [p for p in psutil.process_iter()] # only consider procs originating from this job procs_job, procs_denied = [], [] for p in procs_node: try: if p.environ().get('SLURM_JOBID') == job_id: procs_job.append(p) except psutil.AccessDenied: procs_denied.append(p) pwork = [p for p in procs_job if p.name() == 'process_worker_'] pworker = sum([p.children() for p in pwork], []) ptasks = sum([p.children() for p in pworker], []) print( f'Job processes (job_id={job_id}, node_id={node_id}):', *procs_job, 'Main process :', pmain, 'Kill process:', pkill, 'Workers:', *pworker, 'Running tasks:', *ptasks, sep='\n' ) return ptasks def main(): """""" time_start = datetime.datetime.fromtimestamp(float(os.environ.get('SLURM_JOB_START_TIME', 0))) time_stop = datetime.datetime.fromtimestamp(float(os.environ.get('SLURM_JOB_END_TIME', 0))) duration = time_stop - time_start print(f'Job allocation (start|stop|duration): {time_start} | {time_stop} | {duration}') print('Awaiting the app-ocalypse..') signal.alarm((time_stop - datetime.datetime.now()).seconds - EXIT_WINDOW) signal.signal(signal.SIGALRM, signal_handler) signal.sigwait([signal.SIGALRM]) # TODO: this kills process? print(f'Received signal {signal.SIGALRM.name} at {datetime.datetime.now()}') for p in find_processes(): print(f'Sending {EXIT_SIGNAL.name} to process {p.pid}..') os.kill(p.pid, EXIT_SIGNAL) exit(EXIT_CODE) if __name__ == "__main__": main() ```

job script/logs

_The originating Python script controlling Parsl uses some custom code, but all of that is irrelevant. Essentially, we launch bash apps that sleep indefinitely until they catch a signal._ **parsl.hpc_htex.block-0.1730729332.0925918** ``` #!/bin/bash #SBATCH --job-name=parsl.hpc_htex.block-0.1730729332.0925918 #SBATCH --output=/kyukon/scratch/gent/vo/000/gvo00003/vsc43633/docteur/2024_10_30_testing_parsl/graceful_exit/runinfo/000/submit_scripts/parsl.hpc_htex.block-0.1730729332.0925918.stdout #SBATCH --error=/kyukon/scratch/gent/vo/000/gvo00003/vsc43633/docteur/2024_10_30_testing_parsl/graceful_exit/runinfo/000/submit_scripts/parsl.hpc_htex.block-0.1730729332.0925918.stderr #SBATCH --nodes=1 #SBATCH --time=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=4g #SBATCH --cpus-per-task=2 eval "$("$MAMBA_EXE" shell hook --shell bash --prefix "$MAMBA_ROOT_PREFIX" 2> /dev/null)" micromamba activate main; micromamba info export PYTHONPATH=$PYTHONPATH:$VSC_DATA_VO_USER echo "PYTHONPATH = $PYTHONPATH" python /user/gent/436/vsc43633/DATA/mypackage/parsl/graceful_shutdown.py & export PARSL_MEMORY_GB=4 export PARSL_CORES=2 export JOBNAME="parsl.hpc_htex.block-0.1730729332.0925918" set -e export CORES=$SLURM_CPUS_ON_NODE export NODES=$SLURM_JOB_NUM_NODES [[ "1" == "1" ]] && echo "Found cores : $CORES" [[ "1" == "1" ]] && echo "Found nodes : $NODES" WORKERCOUNT=1 cat << SLURM_EOF > cmd_$SLURM_JOB_NAME.sh process_worker_pool.py -a 157.193.252.90,10.141.10.67,10.143.10.67,172.24.10.67,127.0.0.1 -p 0 -c 1 -m 2 --poll 10 --task_port=54670 --result_port=54326 --cert_dir None --logdir=/kyukon/scratch/gent/vo/000/gvo00003/vsc43633/docteur/2024_10_30_testing_parsl/graceful_exit/runinfo/000/hpc_htex --block_id=0 --hb_period=30 --hb_threshold=120 --drain_period=None --cpu-affinity none --mpi-launcher=mpiexec --available-accelerators SLURM_EOF chmod a+x cmd_$SLURM_JOB_NAME.sh srun --ntasks 1 -l bash cmd_$SLURM_JOB_NAME.sh [[ "1" == "1" ]] && echo "Done" ``` **parsl.hpc_htex.block-0.1730729332.0925918.stderr** ``` srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** JOB 50069807 ON node3506.doduo.os CANCELLED AT 2024-11-04T15:09:57 *** 0: slurmstepd: error: *** STEP 50069807.0 ON node3506.doduo.os CANCELLED AT 2024-11-04T15:09:57 *** ``` **parsl.hpc_htex.block-0.1730729332.0925918.stdout** ``` __ __ ______ ___ ____ _____ ___ / /_ ____ _ / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/ / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ / / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/ /_/ environment : /kyukon/data/gent/vo/000/gvo00003/vsc43633/micromamba/envs/main (active) env location : /kyukon/data/gent/vo/000/gvo00003/vsc43633/micromamba/envs/main user config files : /user/gent/436/vsc43633/.mambarc populated config files : /user/gent/436/vsc43633/.condarc libmamba version : 1.4.3 micromamba version : 1.4.3 curl version : libcurl/7.88.1 OpenSSL/3.1.0 zlib/1.2.13 zstd/1.5.2 libssh2/1.10.0 nghttp2/1.52.0 libarchive version : libarchive 3.6.2 zlib/1.2.13 bz2lib/1.0.8 libzstd/1.5.2 virtual packages : __unix=0=0 __linux=4.18.0=0 __glibc=2.28=0 __archspec=1=x86_64 channels : base environment : /kyukon/data/gent/vo/000/gvo00003/vsc43633/micromamba platform : linux-64 PYTHONPATH = :/data/gent/vo/000/gvo00003/vsc43633 Found cores : 2 Found nodes : 1 Job allocation (start|stop|duration): 2024-11-04 15:09:14 | 2024-11-04 15:10:14 | 0:01:00 Awaiting the app-ocalypse.. Received signal SIGALRM at 2024-11-04 15:09:43.442739 Job processes (job_id=50069807, node_id=node3506.doduo.os): psutil.Process(pid=2289391, name='slurm_script', status='sleeping', started='15:09:14') psutil.Process(pid=2289408, name='python', status='running', started='15:09:15') psutil.Process(pid=2289411, name='srun', status='sleeping', started='15:09:15') psutil.Process(pid=2289412, name='srun', status='sleeping', started='15:09:15') psutil.Process(pid=2289426, name='bash', status='sleeping', started='15:09:15') psutil.Process(pid=2289427, name='process_worker_', status='sleeping', started='15:09:15') psutil.Process(pid=2289439, name='python', status='sleeping', started='15:09:19') psutil.Process(pid=2289440, name='python', status='sleeping', started='15:09:19') psutil.Process(pid=2289448, name='python', status='sleeping', started='15:09:19') psutil.Process(pid=2289449, name='python', status='sleeping', started='15:09:19') psutil.Process(pid=2289508, name='python', status='sleeping', started='15:09:22') Main process : psutil.Process(pid=2289391, name='slurm_script', status='sleeping', started='15:09:14') Kill process: psutil.Process(pid=2289408, name='python', status='running', started='15:09:15') Workers: psutil.Process(pid=2289439, name='python', status='sleeping', started='15:09:19') psutil.Process(pid=2289440, name='python', status='sleeping', started='15:09:19') psutil.Process(pid=2289448, name='python', status='sleeping', started='15:09:19') psutil.Process(pid=2289449, name='python', status='sleeping', started='15:09:19') Running tasks: psutil.Process(pid=2289508, name='python', status='sleeping', started='15:09:22') Sending SIGUSR1 to process 2289508.. ```

svandenhaute commented 1 week ago

You can use https://github.com/Parsl/parsl/blob/6dd5845baff745ae012b6c681926e3ca20278033/parsl/executors/high_throughput/executor.py#L135

benclifford commented 1 week ago

You could play it safe and always checkpoint periodically. Brute-forcing it should work in most scenarios, but feels somewhat inelegant.

This is pretty much the traditional approach that parsl's worker model has had, but in recent times we've been pushing more towards managing the end of things a bit better, mostly with things like the drain time and trying to avoid placing tasks on soon-to-end workers (see also #3323).

Having the worker pool send unix signals to launched bash apps is probably an interesting thing to implement - triggered by either the external batch system or by knowledge of the environment (drain style)

Parsl / parsl

Gracefully kill running tasks before walltime (Slurm) #3674