Azure / cyclecloud-slurm

Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
MIT License
59 stars 43 forks source link

Feature: resync nodes with CycleCloud after failures #275

Open ryanhamel opened 4 months ago

ryanhamel commented 4 months ago

When a failed suspend, or to a lesser extent, resume, happens we need a tool to quickly resync the state in CycleCloud with that of Slurm.

This most likely needs to be interactive by default, with only automated actions taking place if something like --force is passed.

aditigaur4 commented 4 months ago

This will be very useful for CC keep alive which currently does not sync with slurm.