We've seen a few times the state has been mismatched between the real state of the VMs from the cloud provider's perspective and what Slurm thinks is true.
I think that the simplest solution to this is a daemon which runs on the management node to constantly check for consistency and correct anything that it can.
This is now in place. It currently does not perform any state fixing, it just reports on unmatched states. The code is in clusterinthecloud/python-citc.
We've seen a few times the state has been mismatched between the real state of the VMs from the cloud provider's perspective and what Slurm thinks is true.
I think that the simplest solution to this is a daemon which runs on the management node to constantly check for consistency and correct anything that it can.