Open giordano opened 1 month ago
How this is meant to work is that the Parsl scaling code - the same code that submits the batch job - is also meant to cancel the batch job at exit. That's what is meant to kill process worker pools, rather than the pools exiting themselves.
You need to shut down parsl to do that though -- this used to happen at exit of the workflow script automatically, but modern Python is increasingly hostile to doing complicated things at Python shutdown and so this was removed in PR #3165
You can use Parsl as a context manager like this:
with parsl.load(config)
test().result()
and when the with
block exits, Parsl will shut down.
That's the point at which batch jobs should be cancelled. You should see that happen in parsl.log
along with a load of other shutdown stuff happening.
If you are still getting leftover batch jobs even with with
, attach a full parsl.log
from your example above and I'll have a look for anything obviously weird.
Describe the bug
I have a pipeline for an SGE-based cluster which looks roughly like
The bash app works all fine as far as I can tell (also the more complicated one I'm actually using, I'm showing here
echo hello world
just for simplicity), but the problem is that the job never finishes and is only killed by the scheduler when the requested is walltime is reached.The submit job script looks like
I can't spot anything wrong with the job script options, my understanding is that
process_worker_pool.py
never finishes andwait $PID
waits forever. I also don't know if this is really specific to SGE, this is just where I'm experiencing the issue.To Reproduce
Steps to reproduce the behavior, for e.g:
Expected behavior
Ideally the job would finish when the app work is done, not until the walltime, which may be set conservatively large, and it's a waste of resources to keep a node busy for doing exactly nothing.
Environment
Distributed Environment