Nodes that can't boot due to limits get lost forever

arnaudfroidmont commented 5 years ago

First of all, great job on this.

I just noticed that when you try to launch more nodes than your limits are allowing you to. It puts an error in the log but it has to wait until the end of the timeout to mark the machine as down and then, it will never try to restart the machine again, even though your limit may be higher, or someone else in your tenancy may have turned off their machine.

What would be nice is that if the machine can boot, it instantly release the job that is assigned to it and gets put in a state where it could be restarted after a certain timeout.

milliams commented 5 years ago

You're right, this can be an issue. The root cause is that there is not much direct communication between OCI and Slurm so the two can get out of sync. I think what we would want to do here is make sure that any nodes which are marked as down due to service limits etc. get resuscitated as soon as possible. We might have to create some sort of watchdog process to manage this kind of thing.

arnaudfroidmont commented 5 years ago

That'd be awesome. Let me know if I can be of any help on this.

clusterinthecloud / support

Nodes that can't boot due to limits get lost forever #3