Azure / az-hop

The Azure HPC On-Demand Platform provides an HPC Cluster Ready solution
https://azure.github.io/az-hop/
MIT License
62 stars 52 forks source link

Ansible doesn't update slurm queue configs #991

Open matt-chan opened 1 year ago

matt-chan commented 1 year ago

In what area(s)?

/area administration /area ansible /area autoscaling /area configuration /area cyclecloud /area documentation /area image

/area job-scheduling

/area monitoring /area ood /area remote-visualization /area user-management

Expected Behavior

cyclecloud.conf and topology.conf should reflect config.yaml

Actual Behavior

Adding a queue into config.yml and then later removing it doesn't remove the entry in cyclecloud.conf or topology.conf.

Also leads to 10-queue limit on cyclecloud quickly, and can lead to conflicts in conf files.

Steps to Reproduce the Problem

Create config.yml with some queues and slurm. Remove some queues and re-run install.sh cccluster and scheduler. cyclecloud.conf and topology.conf don't reflect it.

matt-chan commented 1 year ago

After a bit of debugging, it looks like it might be related to the way that cyclecloud-slurm parses information from the cluster. I'm not sure where the root cause is because I get lost after an API call.

I believe the offending function is here: https://github.com/Azure/cyclecloud-slurm/blob/7b26c3f8bd8180eb0f16fc6c15db17f1fb42ba4d/specs/default/chef/site-cookbooks/slurm/files/default/cyclecloud_slurm.py#L348

Also, an easier way to generate the offending cyclecloud.conf is to run /opt/cycle/slurm/cyclecloud_slurm.sh slurm_conf

matt-chan commented 1 year ago

Okay I did more debugging and I think I found the source of bug.

The ansible scripts and cyclecloud are both mostly working as expected. The issue is that when the cyclecloud_cluster playbook calls cyclecloud to generate the cluster, it uses import_cluster: https://github.com/Azure/az-hop/blob/a61e60c828102399047a1723cd1ce5dc8e66c540/playbooks/roles/cyclecloud_cluster/tasks/main.yml#L145

This will add any queues into the existing cyclecloud cluster, and will not destroy any of the existing ones. I know the --force help text says it will destroy and recreate, but it isn't happening. The template config files only show the new queues (attached, see below), while the cluster status API (also attached) reports the set-union of all queues ever defined since cccluster was first installed.

@xpillons , would it be possible to change this behavior to simply destroy the cluster and recreate it please? Or ideally just to destroy the queues which are not defined within config.yml anymore (I know this is way harder).


# azhop-slurm.txt
...
    [[node nodearraybase]]
    Abstract = true
        [[[configuration]]]
        slurm.autoscale = true
        #slurm.node_prefix = ${ifThenElse(NodeNamePrefix=="Cluster Prefix", StrJoin("-", ClusterName, ""), NodeNamePrefix)}
        slurm.use_nodename_as_hostname = true
        slurm.dampen_memory = 8 # Reservation of 8% of the node's memory
        [[[cluster-init cyclecloud/slurm:execute:2.6.2]]]

    [[nodearray test_d4]]
    Extends = nodearraybase
    MachineType = Standard_D4s_v5
    MaxCoreCount = 2400
      EnableAcceleratedNetworking = True
        # Lookup image version for that queue
      ImageName = /subscriptions/e7cca478-89b5-4f94-a081-4ad6ad37b08d/resourceGroups/az-hop-dev2/providers/Microsoft.Compute/galleries/azhop_1lrygvxz/images/azhop-centos79-v2-rdma-gpgpu/versions/7.9.220722214
            [[[configuration]]]
        slurm.partition = test_d4
              slurm.default_partition = true
                    slurm.hpc = true
                    [[[cluster-init enroot:default:1.0.0]]]
          [[nodearray nc24ads-A100-v4-high]]
    Extends = nodearraybase
    MachineType = Standard_NC24ads_A100_v4
    MaxCoreCount = 2400
        # Lookup image version for that queue
      ImageName = /subscriptions/e7cca478-89b5-4f94-a081-4ad6ad37b08d/resourceGroups/az-hop-dev2/providers/Microsoft.Compute/galleries/azhop_1lrygvxz/images/azhop-centos79-v2-rdma-gpgpu/versions/7.9.220722214
            [[[configuration]]]
        slurm.partition = nc24ads-A100-v4-high
                    slurm.hpc = true
                    [[[cluster-init enroot:default:1.0.0]]]
          [[nodearray ibtest_16]]
    Extends = nodearraybase
    MachineType = Standard_HB120-16rs_v2
    MaxCoreCount = 12000
      EnableAcceleratedNetworking = True
        # Lookup image version for that queue
      ImageName = /subscriptions/e7cca478-89b5-4f94-a081-4ad6ad37b08d/resourceGroups/az-hop-dev2/providers/Microsoft.Compute/galleries/azhop_1lrygvxz/images/azhop-centos79-v2-rdma-gpgpu/versions/7.9.220722214
            [[[configuration]]]
        slurm.partition = ibtest_16
                    slurm.hpc = True
                    [[[cluster-init enroot:default:1.0.0]]]

# cluster/slurm1/status API
{
  "state" : "Started",
  "targetState" : "Started",
  "maxCount" : 10000,
  "maxCoreCount" : 1000000,
  "nodearrays" : [ {
    "name" : "ibtest_16",
    "maxCount" : 10000,
    "maxCoreCount" : 12000,
    "nodearray" : {
      "UsePublicNetwork" : false,
      "Extends" : "nodearraybase",
      "Volumes" : {
        "boot" : {
          "StorageAccountType" : "StandardSSD_LRS",
          "Name" : "boot"
        }
      },
      "SubnetId" : "az-hop-dev2/hpcvnet/compute",
      "Region" : "eastus",
      "Status" : "Activated",
      "State" : "Activated",
      "MaxCoreCount" : 12000,
      "MachineType" : "Standard_HB120-16rs_v2",
      "ActivePhases" : [ ],
      "Credentials" : "azure",
      "EnableAcceleratedNetworking" : true,
      "AccountName" : "azure",
      "Configuration" : {
        "cyclecloud" : {
          "exports" : {
            "sched" : {
              "samba" : {
                "enabled" : false
              },
              "disabled" : true
            },
            "shared" : {
              "samba" : {
                "enabled" : false
              },
              "disabled" : true
            },
            "defaults" : {
              "samba" : {
                "enabled" : false
              }
            }
          },
          "hosts" : {
            "standalone_dns" : {
              "enabled" : false
            },
            "simple_vpc_dns" : {
              "enabled" : false
            }
          },
          "mounts" : {
            "sched" : {
              "disabled" : true
            },
            "shared" : {
              "disabled" : true
            },
            "nfs_anfhome" : {
              "type" : "nfs",
              "mountpoint" : "/anfhome",
              "address" : "10.0.2.4",
              "export_path" : "home-1lrygvxz"
            },
            "nfs_sched" : {
              "type" : "nfs",
              "mountpoint" : "/sched",
              "address" : "10.0.2.4",
              "export_path" : "home-1lrygvxz/slurm/config"
            }
          },
          "converge_on_boot" : true
        },
        "keepalive" : {
          "timeout" : 3600
        },
        "cshared" : {
          "server" : {
            "legacy_links_disabled" : true
          }
        },
        "slurm" : {
          "user" : {
            "gid" : 11100,
            "uid" : 11100
          },
          "install" : false,
          "hpc" : true,
          "autoscale" : true,
          "partition" : "ibtest_16",
          "version" : "20.11.9-1",
          "dampen_memory" : 8,
          "use_nodename_as_hostname" : true,
          "accounting" : {
            "enabled" : false
          }
        },
        "munge" : {
          "user" : {
            "gid" : 11101,
            "uid" : 11101
          }
        }
      },
      "Interruptible" : false,
      "KeyPairLocation" : "~/.ssh/cyclecloud.pem",
      "ClusterInitSpecs" : {
        "slurm:default" : {
          "Order" : 1000,
          "Spec" : "default",
          "Name" : "cyclecloud/slurm:default:2.6.2",
          "Project" : "slurm",
          "Version" : "2.6.2",
          "SourceLocker" : "cyclecloud"
        },
        "enroot:default" : {
          "Order" : 1005,
          "Spec" : "default",
          "Name" : "enroot:default:1.0.0",
          "Project" : "enroot",
          "Version" : "1.0.0"
        },
        "slurm:execute" : {
          "Order" : 1002,
          "Spec" : "execute",
          "Name" : "cyclecloud/slurm:execute:2.6.2",
          "Project" : "slurm",
          "Version" : "2.6.2",
          "SourceLocker" : "cyclecloud"
        },
        "common:default" : {
          "Order" : 1001,
          "Spec" : "default",
          "Name" : "common:default:1.0.0",
          "Project" : "common",
          "Version" : "1.0.0"
        }
      },
      "ShutdownPolicy" : "Terminate",
      "ImageName" : "/subscriptions/e7cca478-89b5-4f94-a081-4ad6ad37b08d/resourceGroups/az-hop-dev2/providers/Microsoft.Compute/galleries/azhop_1lrygvxz/images/azhop-centos79-v2-rdma-gpgpu/versions/7.9.220722214",
      "PhaseMap" : {
        "NodeArrayActivation" : {
          "StartTime" : {
            "$date" : "2022-07-22T22:02:47.450+00:00"
          },
          "EndTime" : {
            "$date" : "2022-07-22T22:03:02.126+00:00"
          },
          "Status" : "Completed"
        }
      },
      "TargetState" : "Activated"
    },
    "buckets" : [ {
      "bucketId" : "0502aaf6-0b8b-4bb2-91f1-2727f0ec4699",
      "definition" : {
        "machineType" : "Standard_HB120-16rs_v2"
      },
      "regionalQuotaCount" : 6,
      "regionalQuotaCoreCount" : 770,
      "regionalConsumedCoreCount" : 230,
      "familyQuotaCount" : 2,
      "familyQuotaCoreCount" : 240,
      "familyConsumedCoreCount" : 0,
      "quotaCount" : 2,
      "quotaCoreCount" : 240,
      "consumedCoreCount" : 0,
      "maxCount" : 2,
      "maxCoreCount" : 32,
      "activeCount" : 0,
      "activeCoreCount" : 0,
      "availableCount" : 2,
      "availableCoreCount" : 32,
      "valid" : true,
      "invalidReason" : "",
      "maxPlacementGroupSize" : 100,
      "maxPlacementGroupCoreSize" : 1600,
      "placementGroups" : [ {
        "name" : "ibtest_16-Standard_HB120-16rs_v2-pg0",
        "activeCount" : 2,
        "activeCoreCount" : 32
      } ],
      "virtualMachine" : {
        "vcpuCount" : 16,
        "pcpuCount" : 16,
        "gpuCount" : 0,
        "vcpuQuotaCount" : 120,
        "memory" : 445.31,
        "infiniband" : true
      }
    } ]
  }, {
    "name" : "nc24-high",
    "maxCount" : 10000,
    "maxCoreCount" : 2400,
    "nodearray" : {
      "UsePublicNetwork" : false,
      "Extends" : "nodearraybase",
      "Volumes" : {
        "boot" : {
          "StorageAccountType" : "StandardSSD_LRS",
          "Name" : "boot"
        }
      },
      "SubnetId" : "az-hop-dev2/hpcvnet/compute",
      "Region" : "eastus",
      "Status" : "Activated",
      "State" : "Activated",
      "MaxCoreCount" : 2400,
      "MachineType" : "Standard_NC24",
      "ActivePhases" : [ ],
      "Credentials" : "azure",
      "EnableAcceleratedNetworking" : false,
      "AccountName" : "azure",
      "Configuration" : {
        "cyclecloud" : {
          "exports" : {
            "sched" : {
              "samba" : {
                "enabled" : false
              },
              "disabled" : true
            },
            "shared" : {
              "samba" : {
                "enabled" : false
              },
              "disabled" : true
            },
            "defaults" : {
              "samba" : {
                "enabled" : false
              }
            }
          },
          "hosts" : {
            "standalone_dns" : {
              "enabled" : false
            },
            "simple_vpc_dns" : {
              "enabled" : false
            }
          },
          "mounts" : {
            "sched" : {
              "disabled" : true
            },
            "shared" : {
              "disabled" : true
            },
            "nfs_anfhome" : {
              "type" : "nfs",
              "mountpoint" : "/anfhome",
              "address" : "10.0.2.4",
              "export_path" : "home-1lrygvxz"
            },
            "nfs_sched" : {
              "type" : "nfs",
              "mountpoint" : "/sched",
              "address" : "10.0.2.4",
              "export_path" : "home-1lrygvxz/slurm/config"
            }
          },
          "converge_on_boot" : true
        },
        "cshared" : {
          "server" : {
            "legacy_links_disabled" : true
          }
        },
        "keepalive" : {
          "timeout" : 3600
        },
        "slurm" : {
          "hpc" : true,
          "autoscale" : true,
          "user" : {
            "gid" : 11100,
            "uid" : 11100
          },
          "install" : false,
          "partition" : "nc24-high",
          "version" : "20.11.9-1",
          "dampen_memory" : 8,
          "use_nodename_as_hostname" : true,
          "accounting" : {
            "enabled" : false
          }
        },
        "munge" : {
          "user" : {
            "gid" : 11101,
            "uid" : 11101
          }
        }
      },
      "Interruptible" : false,
      "KeyPairLocation" : "~/.ssh/cyclecloud.pem",
      "ClusterInitSpecs" : {
        "slurm:default" : {
          "Order" : 1000,
          "Spec" : "default",
          "Name" : "cyclecloud/slurm:default:2.6.2",
          "Project" : "slurm",
          "Version" : "2.6.2",
          "SourceLocker" : "cyclecloud"
        },
        "slurm:execute" : {
          "Order" : 1002,
          "Spec" : "execute",
          "Name" : "cyclecloud/slurm:execute:2.6.2",
          "Project" : "slurm",
          "Version" : "2.6.2",
          "SourceLocker" : "cyclecloud"
        },
        "enroot:default" : {
          "Order" : 1004,
          "Spec" : "default",
          "Name" : "enroot:default:1.0.0",
          "Project" : "enroot",
          "Version" : "1.0.0"
        },
        "common:default" : {
          "Order" : 1001,
          "Spec" : "default",
          "Name" : "common:default:1.0.0",
          "Project" : "common",
          "Version" : "1.0.0"
        }
      },
      "ShutdownPolicy" : "Terminate",
      "ImageName" : "/subscriptions/e7cca478-89b5-4f94-a081-4ad6ad37b08d/resourceGroups/az-hop-dev2/providers/Microsoft.Compute/galleries/azhop_1lrygvxz/images/azhop-centos79-v1-rdma-gpgpu/versions/7.9.220722210",
      "PhaseMap" : {
        "NodeArrayActivation" : {
          "StartTime" : {
            "$date" : "2022-07-23T03:40:56.291+00:00"
          },
          "EndTime" : {
            "$date" : "2022-07-23T03:41:11.107+00:00"
          },
          "Status" : "Completed"
        }
      },
      "TargetState" : "Activated"
    },
    "buckets" : [ {
      "bucketId" : "0fe50db5-49b0-4343-b553-945243087743",
      "definition" : {
        "machineType" : "Standard_NC24"
      },
      "regionalQuotaCount" : 32,
      "regionalQuotaCoreCount" : 770,
      "regionalConsumedCoreCount" : 230,
      "familyQuotaCount" : 4,
      "familyQuotaCoreCount" : 100,
      "familyConsumedCoreCount" : 24,
      "quotaCount" : 4,
      "quotaCoreCount" : 100,
      "consumedCoreCount" : 24,
      "maxCount" : 4,
      "maxCoreCount" : 96,
      "activeCount" : 1,
      "activeCoreCount" : 24,
      "availableCount" : 3,
      "availableCoreCount" : 72,
      "valid" : true,
      "invalidReason" : "",
      "maxPlacementGroupSize" : 100,
      "maxPlacementGroupCoreSize" : 2400,
      "placementGroups" : [ {
        "name" : "nc24-high-Standard_NC24-pg0",
        "activeCount" : 4,
        "activeCoreCount" : 96
      } ],
      "virtualMachine" : {
        "vcpuCount" : 24,
        "pcpuCount" : 24,
        "gpuCount" : 4,
        "vcpuQuotaCount" : 24,
        "memory" : 224.0,
        "infiniband" : false
      }
    } ]
  }, {
    "name" : "nc24ads-A100-v4-high",
    "maxCount" : 10000,
    "maxCoreCount" : 2400,
    "nodearray" : {
      "UsePublicNetwork" : false,
      "Extends" : "nodearraybase",
      "Volumes" : {
        "boot" : {
          "StorageAccountType" : "StandardSSD_LRS",
          "Name" : "boot"
        }
      },
      "SubnetId" : "az-hop-dev2/hpcvnet/compute",
      "Region" : "eastus",
      "Status" : "Activated",
      "State" : "Activated",
      "MaxCoreCount" : 2400,
      "MachineType" : "Standard_NC24ads_A100_v4",
      "ActivePhases" : [ ],
      "Credentials" : "azure",
      "EnableAcceleratedNetworking" : false,
      "AccountName" : "azure",
      "Configuration" : {
        "cyclecloud" : {
          "exports" : {
            "sched" : {
              "samba" : {
                "enabled" : false
              },
              "disabled" : true
            },
            "shared" : {
              "samba" : {
                "enabled" : false
              },
              "disabled" : true
            },
            "defaults" : {
              "samba" : {
                "enabled" : false
              }
            }
          },
          "hosts" : {
            "standalone_dns" : {
              "enabled" : false
            },
            "simple_vpc_dns" : {
              "enabled" : false
            }
          },
          "mounts" : {
            "sched" : {
              "disabled" : true
            },
            "shared" : {
              "disabled" : true
            },
            "nfs_anfhome" : {
              "type" : "nfs",
              "mountpoint" : "/anfhome",
              "address" : "10.0.2.4",
              "export_path" : "home-1lrygvxz"
            },
            "nfs_sched" : {
              "type" : "nfs",
              "mountpoint" : "/sched",
              "address" : "10.0.2.4",
              "export_path" : "home-1lrygvxz/slurm/config"
            }
          },
          "converge_on_boot" : true
        },
        "cshared" : {
          "server" : {
            "legacy_links_disabled" : true
          }
        },
        "keepalive" : {
          "timeout" : 3600
        },
        "slurm" : {
          "hpc" : true,
          "autoscale" : true,
          "user" : {
            "gid" : 11100,
            "uid" : 11100
          },
          "install" : false,
          "partition" : "nc24ads-A100-v4-high",
          "version" : "20.11.9-1",
          "dampen_memory" : 8,
          "use_nodename_as_hostname" : true,
          "accounting" : {
            "enabled" : false
          }
        },
        "munge" : {
          "user" : {
            "gid" : 11101,
            "uid" : 11101
          }
        }
      },
      "Interruptible" : false,
      "KeyPairLocation" : "~/.ssh/cyclecloud.pem",
      "ClusterInitSpecs" : {
        "slurm:default" : {
          "Order" : 1000,
          "Spec" : "default",
          "Name" : "cyclecloud/slurm:default:2.6.2",
          "Project" : "slurm",
          "Version" : "2.6.2",
          "SourceLocker" : "cyclecloud"
        },
        "slurm:execute" : {
          "Order" : 1002,
          "Spec" : "execute",
          "Name" : "cyclecloud/slurm:execute:2.6.2",
          "Project" : "slurm",
          "Version" : "2.6.2",
          "SourceLocker" : "cyclecloud"
        },
        "enroot:default" : {
          "Order" : 1004,
          "Spec" : "default",
          "Name" : "enroot:default:1.0.0",
          "Project" : "enroot",
          "Version" : "1.0.0"
        },
        "common:default" : {
          "Order" : 1001,
          "Spec" : "default",
          "Name" : "common:default:1.0.0",
          "Project" : "common",
          "Version" : "1.0.0"
        }
      },
      "ShutdownPolicy" : "Terminate",
      "ImageName" : "/subscriptions/e7cca478-89b5-4f94-a081-4ad6ad37b08d/resourceGroups/az-hop-dev2/providers/Microsoft.Compute/galleries/azhop_1lrygvxz/images/azhop-centos79-v2-rdma-gpgpu/versions/7.9.220722214",
      "PhaseMap" : {
        "NodeArrayActivation" : {
          "StartTime" : {
            "$date" : "2022-07-23T04:00:09.914+00:00"
          },
          "EndTime" : {
            "$date" : "2022-07-23T04:00:25.076+00:00"
          },
          "Status" : "Completed"
        }
      },
      "TargetState" : "Activated"
    },
    "buckets" : [ {
      "bucketId" : "cec4286f-0a26-4001-bf85-fb42ae700d86",
      "definition" : {
        "machineType" : "Standard_NC24ads_A100_v4"
      },
      "regionalQuotaCount" : 32,
      "regionalQuotaCoreCount" : 770,
      "regionalConsumedCoreCount" : 230,
      "familyQuotaCount" : 1,
      "familyQuotaCoreCount" : 24,
      "familyConsumedCoreCount" : 24,
      "quotaCount" : 1,
      "quotaCoreCount" : 24,
      "consumedCoreCount" : 24,
      "maxCount" : 1,
      "maxCoreCount" : 24,
      "activeCount" : 1,
      "activeCoreCount" : 24,
      "availableCount" : 0,
      "availableCoreCount" : 0,
      "valid" : true,
      "invalidReason" : "",
      "maxPlacementGroupSize" : 100,
      "maxPlacementGroupCoreSize" : 2400,
      "placementGroups" : [ {
        "name" : "nc24ads-A100-v4-high-Standard_NC24ads_A100_v4-pg0",
        "activeCount" : 1,
        "activeCoreCount" : 24
      } ],
      "virtualMachine" : {
        "vcpuCount" : 24,
        "pcpuCount" : 24,
        "gpuCount" : 1,
        "vcpuQuotaCount" : 24,
        "memory" : 220.0,
        "infiniband" : false
      }
    } ]
  }, {
    "name" : "test_d4",
    "maxCount" : 10000,
    "maxCoreCount" : 2400,
    "nodearray" : {
      "UsePublicNetwork" : false,
      "Extends" : "nodearraybase",
      "Volumes" : {
        "boot" : {
          "StorageAccountType" : "StandardSSD_LRS",
          "Name" : "boot"
        }
      },
      "SubnetId" : "az-hop-dev2/hpcvnet/compute",
      "Region" : "eastus",
      "Status" : "Activated",
      "State" : "Activated",
      "MaxCoreCount" : 2400,
      "MachineType" : "Standard_D4s_v5",
      "ActivePhases" : [ ],
      "Credentials" : "azure",
      "EnableAcceleratedNetworking" : true,
      "AccountName" : "azure",
      "Configuration" : {
        "cyclecloud" : {
          "exports" : {
            "sched" : {
              "samba" : {
                "enabled" : false
              },
              "disabled" : true
            },
            "shared" : {
              "samba" : {
                "enabled" : false
              },
              "disabled" : true
            },
            "defaults" : {
              "samba" : {
                "enabled" : false
              }
            }
          },
          "hosts" : {
            "standalone_dns" : {
              "enabled" : false
            },
            "simple_vpc_dns" : {
              "enabled" : false
            }
          },
          "mounts" : {
            "sched" : {
              "disabled" : true
            },
            "shared" : {
              "disabled" : true
            },
            "nfs_anfhome" : {
              "type" : "nfs",
              "mountpoint" : "/anfhome",
              "address" : "10.0.2.4",
              "export_path" : "home-1lrygvxz"
            },
            "nfs_sched" : {
              "type" : "nfs",
              "mountpoint" : "/sched",
              "address" : "10.0.2.4",
              "export_path" : "home-1lrygvxz/slurm/config"
            }
          },
          "converge_on_boot" : true
        },
        "keepalive" : {
          "timeout" : 3600
        },
        "cshared" : {
          "server" : {
            "legacy_links_disabled" : true
          }
        },
        "slurm" : {
          "default_partition" : true,
          "user" : {
            "gid" : 11100,
            "uid" : 11100
          },
          "install" : false,
          "hpc" : true,
          "autoscale" : true,
          "partition" : "test_d4",
          "version" : "20.11.9-1",
          "dampen_memory" : 8,
          "use_nodename_as_hostname" : true,
          "accounting" : {
            "enabled" : false
          }
        },
        "munge" : {
          "user" : {
            "gid" : 11101,
            "uid" : 11101
          }
        }
      },
      "Interruptible" : false,
      "KeyPairLocation" : "~/.ssh/cyclecloud.pem",
      "ClusterInitSpecs" : {
        "slurm:default" : {
          "Order" : 1000,
          "Spec" : "default",
          "Name" : "cyclecloud/slurm:default:2.6.2",
          "Project" : "slurm",
          "Version" : "2.6.2",
          "SourceLocker" : "cyclecloud"
        },
        "slurm:execute" : {
          "Order" : 1002,
          "Spec" : "execute",
          "Name" : "cyclecloud/slurm:execute:2.6.2",
          "Project" : "slurm",
          "Version" : "2.6.2",
          "SourceLocker" : "cyclecloud"
        },
        "enroot:default" : {
          "Order" : 1003,
          "Spec" : "default",
          "Name" : "enroot:default:1.0.0",
          "Project" : "enroot",
          "Version" : "1.0.0"
        },
        "common:default" : {
          "Order" : 1001,
          "Spec" : "default",
          "Name" : "common:default:1.0.0",
          "Project" : "common",
          "Version" : "1.0.0"
        }
      },
      "ShutdownPolicy" : "Terminate",
      "ImageName" : "/subscriptions/e7cca478-89b5-4f94-a081-4ad6ad37b08d/resourceGroups/az-hop-dev2/providers/Microsoft.Compute/galleries/azhop_1lrygvxz/images/azhop-centos79-v2-rdma-gpgpu/versions/7.9.220722214",
      "PhaseMap" : {
        "NodeArrayActivation" : {
          "StartTime" : {
            "$date" : "2022-07-22T22:02:47.450+00:00"
          },
          "EndTime" : {
            "$date" : "2022-07-22T22:03:02.126+00:00"
          },
          "Status" : "Completed"
        }
      },
      "TargetState" : "Activated"
    },
    "buckets" : [ {
      "bucketId" : "7bd072fa-22b1-4de9-b0b2-8b6bcab2c08d",
      "definition" : {
        "machineType" : "Standard_D4s_v5"
      },
      "regionalQuotaCount" : 192,
      "regionalQuotaCoreCount" : 770,
      "regionalConsumedCoreCount" : 230,
      "familyQuotaCount" : 25,
      "familyQuotaCoreCount" : 100,
      "familyConsumedCoreCount" : 4,
      "quotaCount" : 25,
      "quotaCoreCount" : 100,
      "consumedCoreCount" : 4,
      "maxCount" : 25,
      "maxCoreCount" : 100,
      "activeCount" : 0,
      "activeCoreCount" : 0,
      "availableCount" : 24,
      "availableCoreCount" : 96,
      "valid" : true,
      "invalidReason" : "",
      "maxPlacementGroupSize" : 100,
      "maxPlacementGroupCoreSize" : 400,
      "placementGroups" : [ {
        "name" : "test_d4-Standard_D4s_v5-pg0",
        "activeCount" : 24,
        "activeCoreCount" : 96
      } ],
      "virtualMachine" : {
        "vcpuCount" : 4,
        "pcpuCount" : 2,
        "gpuCount" : 0,
        "vcpuQuotaCount" : 4,
        "memory" : 16.0,
        "infiniband" : false
      }
    } ]
  } ]
}
xpillons commented 1 year ago

@matt-chan I know it's not ideal today, but deleting a cluster may have an impact on running jobs, so we have to validate this before implementing. azhop rely on CycleCloud implementation as well so unless they change their implementation we may have to delete the cluster.

as a temporary workaround you can login on the ccportal VM and remote the cluster with the cyclecloud CLI

matt-chan commented 1 year ago

Hi @xpillons, we ran into a related issue of this bug recently. If we want to upload new versions of images, it appears that the only way to make cyclecloud use them is to reload the queues ('latest' doesn't seem to refresh until the cluster is reloaded).

Could you share a bit more information about how you use make updates to the queues in your own workflow so I can formulate a workaround for us please? Do you only add new queues, and never remove/modify old queues?

Right now we're restarting the entire cluster, which is acceptable because it is unlikely we will modify the queues in production, but if we have to do it every time the image changes, it would be a problem for us.

Also, do you have information on where to find the cyclecloud server source code? I can only find the plugins, so I've been reading the live deployment source code with Vim and grep, which is really not fun. Feel free to email me if the repo is internal-only. Thanks for your help!

xpillons commented 1 year ago

Hi @matt-chan, a new image ID can be assign by running the cccluster and scheduler playbooks. For that new image to be picked up by new nodes all nodes in the same scaleset need to be teared down. You usually keep the same queue names as this will make life easier for users submitting jobs. We haven't find an easier way of doing it at the moment, but one could be to :

please contact me internally