Open gwolski opened 1 week ago
Usually what you'll need to do is resume the rollback. When you do that you'll get an option to skip the failed resource. I'm assuming that the failure can from a custom resource like the ParallelCluster or UpdateHeadNode resource. Expand the section for the failed resources, click that, and then continue the rollback. That should get you to a rollback complete state.
I was trying to do an update of the cluster. I added the ClusterConfig/SlurmSettings/ScaledownIdletime to my cluster config file:
I had noticed earlier that when I downsized some node count, you STOPPED the cluster. I assumed you would do the same here. Maybe not. The cluster was not stopped. The update failed. The UPDATE_ROLLBACK_FAILED. I stopped my cluster. Tried the update again, but it can't do the update as the cluster if in a bad state. How does one get out of a bad state to continue using aws-eda-slurm-cluster install.sh?
That said, I've discovered my above syntax is not valid. the ${cluster}-config gets created, but the actual cluster stack doesn't. Need to document how to add things like SlurmSettings better please. I just can't figure out. Will file another issue for that.