Open giordano opened 4 months ago
How this is meant to work is that the Parsl scaling code - the same code that submits the batch job - is also meant to cancel the batch job at exit. That's what is meant to kill process worker pools, rather than the pools exiting themselves.
You need to shut down parsl to do that though -- this used to happen at exit of the workflow script automatically, but modern Python is increasingly hostile to doing complicated things at Python shutdown and so this was removed in PR #3165
You can use Parsl as a context manager like this:
with parsl.load(config)
test().result()
and when the with
block exits, Parsl will shut down.
That's the point at which batch jobs should be cancelled. You should see that happen in parsl.log
along with a load of other shutdown stuff happening.
If you are still getting leftover batch jobs even with with
, attach a full parsl.log
from your example above and I'll have a look for anything obviously weird.
Sorry, I was able to try this only now, and I can confirm that using the context manager here does indeed the trick for me, thanks! The only thing I noticed is that, even if the bash/python app itself is successful, the job ends with exit code 137 (= 128 + 9 and 9 is SIGKILL), but perhaps that's expected because the job is killed by parsl? The parsl script terminates with 0 as expected.
Only other comment, I'm not sure that using the context manager is necessary in this case is clear in the documentation? Can't pinpoint exactly what were the sections I was looking at though, it was a couple of months ago now.
The job should be terminated by qdel
- see https://github.com/Parsl/parsl/blob/dd9150d7ac26b04eb8ff15247b1c18ce9893f79c/parsl/providers/grid_engine/grid_engine.py#L216 - so I'd expect whatever behaviour you would expect from qdel
. I'd usually expect something more like a SIGTERM
there for batch systems in general, but I don't know exactly what's happening in your situation.
The context manager is pretty always necessary now (due to ongoing changes in how exit/shutdown is handled in Python itself) but because this is new, a lot of documentation doesn't talk about that - if you see any documentation that does a parsl.load()
without a with
, it might be out of date.
Describe the bug
I have a pipeline for an SGE-based cluster which looks roughly like
The bash app works all fine as far as I can tell (also the more complicated one I'm actually using, I'm showing here
echo hello world
just for simplicity), but the problem is that the job never finishes and is only killed by the scheduler when the requested is walltime is reached.The submit job script looks like
I can't spot anything wrong with the job script options, my understanding is that
process_worker_pool.py
never finishes andwait $PID
waits forever. I also don't know if this is really specific to SGE, this is just where I'm experiencing the issue.To Reproduce
Steps to reproduce the behavior, for e.g:
Expected behavior
Ideally the job would finish when the app work is done, not until the walltime, which may be set conservatively large, and it's a waste of resources to keep a node busy for doing exactly nothing.
Environment
Distributed Environment