Parsl / parsl

Parsl - a Python parallel scripting library
http://parsl-project.org
Apache License 2.0
484 stars 194 forks source link

Job submitted via SGE scheduler hangs until walltime #3542

Open giordano opened 1 month ago

giordano commented 1 month ago

Describe the bug

I have a pipeline for an SGE-based cluster which looks roughly like

import parsl
from parsl import bash_app

from parsl.channels import LocalChannel
from parsl.config import Config
from parsl.executors import HighThroughputExecutor
from parsl.providers import GridEngineProvider
from parsl.usage_tracking.levels import LEVEL_1

config = Config(
    executors=[
        HighThroughputExecutor(
            label=system,
            max_workers_per_node=1,
            worker_logdir_root=logdir_root,
            provider=GridEngineProvider(
                channel=LocalChannel(),
                nodes_per_block=1,
                init_blocks=1,
                max_blocks=1,
                walltime="00:30:00",
                scheduler_options='#$ -pe mpi 18',
                # Parsl python environment need to be loaded and activated also
                # on the compute node.
                worker_init="...", # here I'm manually loading shell script file (#3538) and activating Python environment (#3541)
            ),
        )
    ],
    #  AdHoc Clusters should not be setup with scaling strategy.
    strategy='none',
    usage_tracking=LEVEL_1,
)

@bash_app
def test():
    return 'echo hello world'

parsl.load(config)
test().result()

The bash app works all fine as far as I can tell (also the more complicated one I'm actually using, I'm showing here echo hello world just for simplicity), but the problem is that the job never finishes and is only killed by the scheduler when the requested is walltime is reached.

The submit job script looks like

#!/bin/bash
#$ -S /bin/bash
#$ -o ...
#$ -e ...
#$ -cwd
#$ -l h_rt=00:30:00
#$ -pe mpi 18

# here I activate the Python environment...

export JOBNAME="..."

set -e
export CORES=$(getconf _NPROCESSORS_ONLN)
[[ "1" == "1" ]] && echo "Found cores : $CORES"
WORKERCOUNT=1
FAILONANY=0
PIDS=""

CMD() {
process_worker_pool.py  --max_workers_per_node=1 -a 10.34.0.15,127.0.0.1,10.28.101.3,193.60.238.110,10.128.20.15,192.168.122.1,10.128.24.15 -p 0 -c 1.0 -m None --poll 10 --task_port=54459 --result_port=54664 --cert_dir None --logdir=...--block_id=0 --hb_period=30  --hb_threshold=120 --drain_period=None --cpu-affinity none  --mpi-launcher=mpiexec --available-accelerators 
}
for COUNT in $(seq 1 1 $WORKERCOUNT); do
    [[ "1" == "1" ]] && echo "Launching worker: $COUNT"
    CMD $COUNT &
    PIDS="$PIDS $!"
done

ALLFAILED=1
ANYFAILED=0
for PID in $PIDS ; do
    wait $PID
    if [ "$?" != "0" ]; then
        ANYFAILED=1
    else
        ALLFAILED=0
    fi
done

[[ "1" == "1" ]] && echo "All workers done"
if [ "$FAILONANY" == "1" ]; then
    exit $ANYFAILED
else
    exit $ALLFAILED
fi

I can't spot anything wrong with the job script options, my understanding is that process_worker_pool.py never finishes and wait $PID waits forever. I also don't know if this is really specific to SGE, this is just where I'm experiencing the issue.

To Reproduce

Steps to reproduce the behavior, for e.g:

  1. Setup Parsl 2024.07.15 with Python 3.11.3 on cluster
  2. Run the pipeline above
  3. Wait for the job to finish
  4. Realise the app ran successfully in few minutes/seconds, but the job had to wait the walltime to be released

Expected behavior

Ideally the job would finish when the app work is done, not until the walltime, which may be set conservatively large, and it's a waste of resources to keep a node busy for doing exactly nothing.

Environment

Distributed Environment

benclifford commented 1 month ago

How this is meant to work is that the Parsl scaling code - the same code that submits the batch job - is also meant to cancel the batch job at exit. That's what is meant to kill process worker pools, rather than the pools exiting themselves.

You need to shut down parsl to do that though -- this used to happen at exit of the workflow script automatically, but modern Python is increasingly hostile to doing complicated things at Python shutdown and so this was removed in PR #3165

You can use Parsl as a context manager like this:

with parsl.load(config)
    test().result()

and when the with block exits, Parsl will shut down.

That's the point at which batch jobs should be cancelled. You should see that happen in parsl.log along with a load of other shutdown stuff happening.

If you are still getting leftover batch jobs even with with, attach a full parsl.log from your example above and I'll have a look for anything obviously weird.