aws-samples / aws-eda-slurm-cluster

AWS Slurm Cluster for EDA Workloads
MIT No Attribution
28 stars 7 forks source link

Performing an update w/o the cluster stopped messed up the CloudFormation stack and now I can't do any mods. #271

Open gwolski opened 1 week ago

gwolski commented 1 week ago

I was trying to do an update of the cluster. I added the ClusterConfig/SlurmSettings/ScaledownIdletime to my cluster config file:

  ParallelClusterConfig:
    Version: 3.11.1
    Architecture: x86_64
    Image:
      Os: rocky8
      CustomAmi: ami-0d68c6538XXXXXXX 
    DisableSimultaneousMultithreading: true
    ClusterConfig:
      SlurmSettings:
        ScaledownIdletime: 20

I had noticed earlier that when I downsized some node count, you STOPPED the cluster. I assumed you would do the same here. Maybe not. The cluster was not stopped. The update failed. The UPDATE_ROLLBACK_FAILED. I stopped my cluster. Tried the update again, but it can't do the update as the cluster if in a bad state. How does one get out of a bad state to continue using aws-eda-slurm-cluster install.sh?

That said, I've discovered my above syntax is not valid. the ${cluster}-config gets created, but the actual cluster stack doesn't. Need to document how to add things like SlurmSettings better please. I just can't figure out. Will file another issue for that.

cartalla commented 1 week ago

Usually what you'll need to do is resume the rollback. When you do that you'll get an option to skip the failed resource. I'm assuming that the failure can from a custom resource like the ParallelCluster or UpdateHeadNode resource. Expand the section for the failed resources, click that, and then continue the rollback. That should get you to a rollback complete state.