flatironinstitute / disBatch

Tool to distribute a list of computational tasks over a pool of compute resources. The pool can grow or shrink.
Apache License 2.0
39 stars 8 forks source link

Task failure checks #16

Closed lgarrison closed 4 years ago

lgarrison commented 4 years ago

First of all, thanks for developing this great package! There were a few small features I needed for my use-case, so I hacked them into the code, probably in a sub-optimal way. But maybe they could make their way into master with some expert advice.

First, I had a non-disBatch SLURM job that I wanted to trigger after successful completion of multiple disBatch jobs, so I was going to use SLURM dependencies with the "afterok" condition. So I needed disBatch to return a non-zero exit code if any task failed. It seemed that disBatch was already counting task failures, so I just added a sys.exit(1) at the end if a failure was counted.

Separately, I also had a #DISBATCH BARRIER that I only wanted to pass if all previous tasks succeeded. There was already a "check barrier" mechanism in the Python, so I added a #DISBATCH BARRIER CHECK option to trigger that. I think it means that CHECK is no longer a valid key to BARRIER, so my guess is that a different syntax would be preferred. Happy to take suggestions.

dylex commented 4 years ago

We decided to only do the exit-on-failure behavior when -e is specified, but otherwise looks great. Thanks!

lgarrison commented 4 years ago

Sounds great, thanks!

On Mon, Mar 9, 2020 at 7:50 PM Dylan Simon notifications@github.com wrote:

We decided to only do the exit-on-failure behavior when -e is specified, but otherwise looks great. Thanks!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/disBatch/pull/16?email_source=notifications&email_token=ABLA7SZX247GXFZKEDK6J73RGWFEBA5CNFSM4LEQH4M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOJPCWA#issuecomment-596832600, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLA7S3SM7I7N2PENTIRBSTRGWFEBANCNFSM4LEQH4MQ .

-- Lehman Garrison lgarrison@flatironinstitute.org Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io