JuliaParallel / ClusterManagers.jl

Other
242 stars 74 forks source link

[SlurmManager] 100 % CPU usage while waiting for the job to get created #173

Closed David96 closed 3 years ago

David96 commented 3 years ago

Dear all,

I just noticed that Julia consumes 100 % CPU time while waiting for a Slurm Job to get created. This is likely caused by this code: https://github.com/JuliaParallel/ClusterManagers.jl/blob/861b301deac77b84bbd7ed7bb44e3c964e515573/src/slurm.jl#L62-L79 Note the while true loop. This doesn't really matter if jobs get created instantly but it is quite common (at least on the cluster I have access to) to have to wait quite a while for all jobs to get assigned resources. During this time this causes a very high cpu usage on the login node which only makes life worse for other people by hogging resources.

I'm not sure what the "right" way of fixing this is, a (temporary) solution could be to just add a sleep to the while loop so it consumes less resources. If you think that's appropriate, I can do a pull request for that, if not I'm open for suggestions.

Best regards David

kescobo commented 3 years ago

Perhaps we should put a sleep(0.1) or something at the end of the loop so that it doesn't spin out of control?

EDIT: Reading comprehension fail, you already said that :facepalm:

kescobo commented 3 years ago

One additional option might be to add in some logic so that if it's stalled for more than X seconds, it errors out. I have no sense of what a good default for X is, or if it should be user-configurable, but could open a separate issue for that if it's outside the scope of the PR you want to make. I'm fine with a simple addition of sleep if you want to spin that up

David96 commented 3 years ago

Opened a pull request for the sleep addition. Adding a timer would also be helpful I think but I don't believe there's a good "default" for it - depending on how busy a cluster is the usual waiting times might differ by a lot. Another thing that is kind of related is that this loop also doesn't stop if the job gets cancelled before it has actually been started which of course would also more or less be handled by a timeout.

DrChainsaw commented 3 years ago

Fwiw, the lsf manager lets the user supply an iterator with retry delays. The default is an exponential backoff with a max number of attempts, but I tend to just use Iterators.cyclic(5) to not have to deal with timeouts as the startup time at my place is extremely unpredictable.

https://github.com/JuliaParallel/ClusterManagers.jl/blob/861b301deac77b84bbd7ed7bb44e3c964e515573/src/lsf.jl#L43

David96 commented 3 years ago

That actually sounds like quite a smart solution to me as it combines the limited resource usage with a timeout in an easy to use way. I can try to spin up a PR for that, too, but it won't be this week.

kescobo commented 3 years ago

Alright, I'm gonna merge the PR for now, but looking forward to next one that superceeds it