Parsl / parsl

Parsl - a Python parallel scripting library
http://parsl-project.org
Apache License 2.0
506 stars 195 forks source link

Job submitted via SGE scheduler hangs until walltime #3542

Open giordano opened 3 months ago

giordano commented 3 months ago

Describe the bug

I have a pipeline for an SGE-based cluster which looks roughly like

import parsl
from parsl import bash_app

from parsl.channels import LocalChannel
from parsl.config import Config
from parsl.executors import HighThroughputExecutor
from parsl.providers import GridEngineProvider
from parsl.usage_tracking.levels import LEVEL_1

config = Config(
    executors=[
        HighThroughputExecutor(
            label=system,
            max_workers_per_node=1,
            worker_logdir_root=logdir_root,
            provider=GridEngineProvider(
                channel=LocalChannel(),
                nodes_per_block=1,
                init_blocks=1,
                max_blocks=1,
                walltime="00:30:00",
                scheduler_options='#$ -pe mpi 18',
                # Parsl python environment need to be loaded and activated also
                # on the compute node.
                worker_init="...", # here I'm manually loading shell script file (#3538) and activating Python environment (#3541)
            ),
        )
    ],
    #  AdHoc Clusters should not be setup with scaling strategy.
    strategy='none',
    usage_tracking=LEVEL_1,
)

@bash_app
def test():
    return 'echo hello world'

parsl.load(config)
test().result()

The bash app works all fine as far as I can tell (also the more complicated one I'm actually using, I'm showing here echo hello world just for simplicity), but the problem is that the job never finishes and is only killed by the scheduler when the requested is walltime is reached.

The submit job script looks like

#!/bin/bash
#$ -S /bin/bash
#$ -o ...
#$ -e ...
#$ -cwd
#$ -l h_rt=00:30:00
#$ -pe mpi 18

# here I activate the Python environment...

export JOBNAME="..."

set -e
export CORES=$(getconf _NPROCESSORS_ONLN)
[[ "1" == "1" ]] && echo "Found cores : $CORES"
WORKERCOUNT=1
FAILONANY=0
PIDS=""

CMD() {
process_worker_pool.py  --max_workers_per_node=1 -a 10.34.0.15,127.0.0.1,10.28.101.3,193.60.238.110,10.128.20.15,192.168.122.1,10.128.24.15 -p 0 -c 1.0 -m None --poll 10 --task_port=54459 --result_port=54664 --cert_dir None --logdir=...--block_id=0 --hb_period=30  --hb_threshold=120 --drain_period=None --cpu-affinity none  --mpi-launcher=mpiexec --available-accelerators 
}
for COUNT in $(seq 1 1 $WORKERCOUNT); do
    [[ "1" == "1" ]] && echo "Launching worker: $COUNT"
    CMD $COUNT &
    PIDS="$PIDS $!"
done

ALLFAILED=1
ANYFAILED=0
for PID in $PIDS ; do
    wait $PID
    if [ "$?" != "0" ]; then
        ANYFAILED=1
    else
        ALLFAILED=0
    fi
done

[[ "1" == "1" ]] && echo "All workers done"
if [ "$FAILONANY" == "1" ]; then
    exit $ANYFAILED
else
    exit $ALLFAILED
fi

I can't spot anything wrong with the job script options, my understanding is that process_worker_pool.py never finishes and wait $PID waits forever. I also don't know if this is really specific to SGE, this is just where I'm experiencing the issue.

To Reproduce

Steps to reproduce the behavior, for e.g:

  1. Setup Parsl 2024.07.15 with Python 3.11.3 on cluster
  2. Run the pipeline above
  3. Wait for the job to finish
  4. Realise the app ran successfully in few minutes/seconds, but the job had to wait the walltime to be released

Expected behavior

Ideally the job would finish when the app work is done, not until the walltime, which may be set conservatively large, and it's a waste of resources to keep a node busy for doing exactly nothing.

Environment

Distributed Environment

benclifford commented 3 months ago

How this is meant to work is that the Parsl scaling code - the same code that submits the batch job - is also meant to cancel the batch job at exit. That's what is meant to kill process worker pools, rather than the pools exiting themselves.

You need to shut down parsl to do that though -- this used to happen at exit of the workflow script automatically, but modern Python is increasingly hostile to doing complicated things at Python shutdown and so this was removed in PR #3165

You can use Parsl as a context manager like this:

with parsl.load(config)
    test().result()

and when the with block exits, Parsl will shut down.

That's the point at which batch jobs should be cancelled. You should see that happen in parsl.log along with a load of other shutdown stuff happening.

If you are still getting leftover batch jobs even with with, attach a full parsl.log from your example above and I'll have a look for anything obviously weird.

giordano commented 1 month ago

Sorry, I was able to try this only now, and I can confirm that using the context manager here does indeed the trick for me, thanks! The only thing I noticed is that, even if the bash/python app itself is successful, the job ends with exit code 137 (= 128 + 9 and 9 is SIGKILL), but perhaps that's expected because the job is killed by parsl? The parsl script terminates with 0 as expected.

Only other comment, I'm not sure that using the context manager is necessary in this case is clear in the documentation? Can't pinpoint exactly what were the sections I was looking at though, it was a couple of months ago now.

benclifford commented 1 month ago

The job should be terminated by qdel - see https://github.com/Parsl/parsl/blob/dd9150d7ac26b04eb8ff15247b1c18ce9893f79c/parsl/providers/grid_engine/grid_engine.py#L216 - so I'd expect whatever behaviour you would expect from qdel. I'd usually expect something more like a SIGTERM there for batch systems in general, but I don't know exactly what's happening in your situation.

The context manager is pretty always necessary now (due to ongoing changes in how exit/shutdown is handled in Python itself) but because this is new, a lot of documentation doesn't talk about that - if you see any documentation that does a parsl.load() without a with, it might be out of date.