GoogleCloudPlatform / cluster-toolkit

Cluster Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy AI/ML and HPC environments on Google Cloud.
Apache License 2.0
188 stars 124 forks source link

Using a newer version of Terraform can lead to controller replacement on reconfigure for Slurm GCP v6 #2774

Open nick-stroud opened 1 month ago

nick-stroud commented 1 month ago

Describe the bug

For most changes, when reconfiguring a Slurm GCP v6 cluster, the controller should not be destroyed. This allows for operations such as adding a partition to be done on a running cluster.

It has been discovered that newer versions of Terraform introduce a bug such that these reconfigure operations cause the controller to be destroyed and recreated instead of being updated in place.

This can be very disruptive to a running cluster as state may be lost such as current queue and accounting information.

Terraform version 1.5 is known to have the intended behavior. Terraform version 1.7 is known to exhibit the bad behavior. This bug is caused by a change in behavior of how Terraform treats drift between state and Terraform code.

The maintainers of this repository are aware of this issue and working to implement a shot and long term fix. If your workflow includes reconfiguring running Slurm GCP v6 clusters, please be advised to not upgrade beyond Terraform 1.5 until this bug has been addressed.

Steps to reproduce

Steps to reproduce the behavior:

  1. Install terraform 1.7
  2. Deploy examples/hpc-slurm.yaml
  3. Add a partition to the blueprint
  4. Re-deploy the blueprint (ghpc deploy -w)

Expected behavior

Without impacting queue, accounting db, or running jobs the partition is added to the cluster. The controller vm is modified in place and is not deleted.

Actual behavior

The controller is deleted and a new controller VM is created.

Blueprint

Any Slurm GCP v6 blueprint.

TimZaman commented 3 weeks ago

Running into this while doing the (official!) slurm-on-gcp tutorial (hpc toolkit)

./ghpc create examples/hpc-slurm.yaml -l ERROR --vars project_id=personal-235500
validator "test_tf_version_for_slurm" failed:
Error: using a newer version of Terraform can lead to controller replacement on reconfigure for Slurm GCP v6

Please be advised of this known issue: https://github.com/GoogleCloudPlatform/hpc-toolkit/issues/2774
Until resolved it is advised to use Terraform 1.4.0 with Slurm deployments.

To silence this warning, add flag: --skip-validators=test_tf_version_for_slurm

One or more blueprint validators has failed. See messages above for suggested
actions. General troubleshooting guidance and instructions for configuring
validators are shown below.

- https://goo.gle/hpc-toolkit-troubleshooting
- https://goo.gle/hpc-toolkit-validation

Validators can be silenced or treated as warnings or errors:

- https://goo.gle/hpc-toolkit-validation-levels
nick-stroud commented 2 weeks ago

@mr0re1 has fixed this on develop. It will be included in the next release.