JuliaParallel / ClusterManagers.jl

Other
232 stars 74 forks source link

[Slurm] Add retry_delays when waiting for workers #176

Closed David96 closed 2 years ago

David96 commented 2 years ago

Fixes #175

Inspired by the LSF code. It's not very well tested though, especially about the error path I'm not sure. From reading the code I think just throwing an exception is fine?

David96 commented 2 years ago

Just something maybe worth noting: for people using SlurmManager directly (not addprocs_slurm) this is actually a breaking change as the retry_delays parameter is required.

kescobo commented 2 years ago

Oh, thank you for noting that, it's definitely important to know about that for SemVer purposes. Can you set a default value? I think you should be able to just define a constructor like SlurmManager(n::Int) = SlurmManager(n, 3) or something.

David96 commented 2 years ago

Added the default constructor, I used the exponential backoff as used in addprocs_slurm (which I got from LSF), too, though. If you want to minimize changes in behavior I can change both of them to a repeated iterator.

kescobo commented 2 years ago

I don't think minimizing changes is the goal, I think we want the behavior that makes most sense as a default. I'd consider this not technically breaking if the same code works to launch jobs, but I'm happy to defer to others. Incrementing a version isn't the worst thing in the world.

kescobo commented 2 years ago

Not sure if you want to use #154 to try to add tests, but AFAIC this is ready to merge. Let me know

David96 commented 2 years ago

I'm not exactly sure how a test on this would look like so from my side, it's ready to merge, too.