Azure / cyclecloud-slurm

Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
MIT License
56 stars 42 forks source link

Feature: resync nodes with CycleCloud after failures #275

Open ryanhamel opened 1 month ago

ryanhamel commented 1 month ago

When a failed suspend, or to a lesser extent, resume, happens we need a tool to quickly resync the state in CycleCloud with that of Slurm.

This most likely needs to be interactive by default, with only automated actions taking place if something like --force is passed.

aditigaur4 commented 1 month ago

This will be very useful for CC keep alive which currently does not sync with slurm.