clusterinthecloud / terraform

Terraform config for Cluster in the Cloud
https://cluster-in-the-cloud.readthedocs.io
MIT License
20 stars 23 forks source link

race condition when running terraform destroy if filesystem destroyed before compute nodes #22

Closed christopheredsall closed 4 years ago

christopheredsall commented 5 years ago

Occasionally, when running terraform destroy on a cluster it will take a long time and eventually fail. This seems to be because the filesystem which contains /mnt/shared/etc/slurm/slurm.conf is destroyed before it is needed to be read to get the compute node info.

oci_file_storage_export.ClusterFSExport: Destroying... (ID: ocid1.export.oc1.iad.aaaaaa4np2soonqpnfqwillqojxwiotjmfsc2ylefuzaaaaa)
oci_core_instance.ClusterManagement: Destroying... (ID: ocid1.instance.oc1.iad.abuwcljtvbv6codx...barl7h54uuypnd7e6golvv7gqy3xgx6vsd2ntq)
oci_core_instance.ClusterManagement: Provisioning with 'remote-exec'...
oci_core_instance.ClusterManagement (remote-exec): Connecting to remote host via SSH...
oci_core_instance.ClusterManagement (remote-exec):   Host: 132.145.208.237
oci_core_instance.ClusterManagement (remote-exec):   User: opc
oci_core_instance.ClusterManagement (remote-exec):   Password: false
oci_core_instance.ClusterManagement (remote-exec):   Private key: true
oci_core_instance.ClusterManagement (remote-exec):   SSH Agent: false
oci_core_instance.ClusterManagement (remote-exec):   Checking Host Key: false
oci_core_instance.ClusterManagement (remote-exec): Connected!
oci_file_storage_export.ClusterFSExport: Destruction complete after 0s
oci_file_storage_file_system.ClusterFS: Destroying... (ID: ocid1.filesystem.oc1.iad.aaaaaaaaaaaalsqhnfqwillqojxwiotjmfsc2ylefuzaaaaa)
oci_file_storage_mount_target.ClusterFSMountTarget: Destroying... (ID: ocid1.mounttarget.oc1.iad.aaaaaby27ve23nwonfqwillqojxwiotjmfsc2ylefuzaaaaa)
oci_core_instance.ClusterManagement (remote-exec): Terminating any remaining compute nodes
oci_core_instance.ClusterManagement (remote-exec): sinfo: error: s_p_parse_file: unable to read "/mnt/shared/etc/slurm/slurm.conf": Unknown error 521
oci_core_instance.ClusterManagement (remote-exec): sinfo: error: "Include" failed in file /etc/slurm/slurm.conf line 34
oci_core_instance.ClusterManagement (remote-exec): sinfo: fatal: Unable to process configuration file
oci_core_instance.ClusterManagement (remote-exec): scontrol: error: s_p_parse_file: unable to read "/mnt/shared/etc/slurm/slurm.conf": Unknown error 521
oci_core_instance.ClusterManagement (remote-exec): scontrol: error: "Include" failed in file /etc/slurm/slurm.conf line 34
oci_core_instance.ClusterManagement (remote-exec): scontrol: fatal: Unable to process configuration file
oci_file_storage_file_system.ClusterFS: Destruction complete after 2s
oci_file_storage_mount_target.ClusterFSMountTarget: Destruction complete after 3s
oci_core_instance.ClusterManagement (remote-exec): Node termination request completed
oci_core_instance.ClusterManagement: Still destroying... (ID: ocid1.instance.oc1.iad.abuwcljtvbv6codx...barl7h54uuypnd7e6golvv7gqy3xgx6vsd2ntq, 10s elapsed)
oci_core_instance.ClusterManagement: Still destroying... (ID: ocid1.instance.oc1.iad.abuwcljtvbv6codx...barl7h54uuypnd7e6golvv7gqy3xgx6vsd2ntq, 20s elapsed)
[ ... ]
Error: Error applying plan:

2 error(s) occurred:

* oci_core_subnet.ClusterSubnet[0] (destroy): 1 error(s) occurred:

* oci_core_subnet.ClusterSubnet.0: Service error:Conflict. The Subnet ocid1.subnet.oc1.iad.aaaaaaaa3ow3n2tn3gx34nvlmrbnwgprq5fznksy5dsjhaquyxkxts3e5ala references the VNIC ocid1.vnic.oc1.iad.abuwcljtei434zygcrm6fa7pcwq44fgxnjq2wufkyt56noickqx6cs4obora. You must remove the reference to proceed with this operation.. http status code: 409. Opc request id: d5f0804c3e00a886e9d89398c48b3825/64ED088EEA7C23F64C7F35855CACDB98/12AB2FC3E72046E7B981E999DB8AF089
* oci_core_subnet.ClusterSubnet[1] (destroy): 1 error(s) occurred:

* oci_core_subnet.ClusterSubnet.1: Service error:Conflict. The Subnet ocid1.subnet.oc1.iad.aaaaaaaad3264gcho6pjcjqb4p32b5nac4i7kq7qo2vtjdynehgxbquavmeq references the VNIC ocid1.vnic.oc1.iad.abuwcljs2a7e2rn5jioe2adaq3eflwzq6t32bvw77lnp5gvxweyoivfzsdvq. You must remove the reference to proceed with this operation.. http status code: 409. Opc request id: 316eb18e0b83fd440c6d5e23c3255e22/F22821F37412A0643FD9589ACF03DD17/2E8E0525A37E40FB97B885FA417B016C