Open matt-chan opened 1 year ago
After a bit of debugging, it looks like it might be related to the way that cyclecloud-slurm parses information from the cluster. I'm not sure where the root cause is because I get lost after an API call.
I believe the offending function is here: https://github.com/Azure/cyclecloud-slurm/blob/7b26c3f8bd8180eb0f16fc6c15db17f1fb42ba4d/specs/default/chef/site-cookbooks/slurm/files/default/cyclecloud_slurm.py#L348
Also, an easier way to generate the offending cyclecloud.conf is to run /opt/cycle/slurm/cyclecloud_slurm.sh slurm_conf
Okay I did more debugging and I think I found the source of bug.
The ansible scripts and cyclecloud are both mostly working as expected. The issue is that when the cyclecloud_cluster playbook calls cyclecloud to generate the cluster, it uses import_cluster: https://github.com/Azure/az-hop/blob/a61e60c828102399047a1723cd1ce5dc8e66c540/playbooks/roles/cyclecloud_cluster/tasks/main.yml#L145
This will add any queues into the existing cyclecloud cluster, and will not destroy any of the existing ones. I know the --force
help text says it will destroy and recreate, but it isn't happening. The template config files only show the new queues (attached, see below), while the cluster status API (also attached) reports the set-union of all queues ever defined since cccluster was first installed.
@xpillons , would it be possible to change this behavior to simply destroy the cluster and recreate it please? Or ideally just to destroy the queues which are not defined within config.yml anymore (I know this is way harder).
# azhop-slurm.txt
...
[[node nodearraybase]]
Abstract = true
[[[configuration]]]
slurm.autoscale = true
#slurm.node_prefix = ${ifThenElse(NodeNamePrefix=="Cluster Prefix", StrJoin("-", ClusterName, ""), NodeNamePrefix)}
slurm.use_nodename_as_hostname = true
slurm.dampen_memory = 8 # Reservation of 8% of the node's memory
[[[cluster-init cyclecloud/slurm:execute:2.6.2]]]
[[nodearray test_d4]]
Extends = nodearraybase
MachineType = Standard_D4s_v5
MaxCoreCount = 2400
EnableAcceleratedNetworking = True
# Lookup image version for that queue
ImageName = /subscriptions/e7cca478-89b5-4f94-a081-4ad6ad37b08d/resourceGroups/az-hop-dev2/providers/Microsoft.Compute/galleries/azhop_1lrygvxz/images/azhop-centos79-v2-rdma-gpgpu/versions/7.9.220722214
[[[configuration]]]
slurm.partition = test_d4
slurm.default_partition = true
slurm.hpc = true
[[[cluster-init enroot:default:1.0.0]]]
[[nodearray nc24ads-A100-v4-high]]
Extends = nodearraybase
MachineType = Standard_NC24ads_A100_v4
MaxCoreCount = 2400
# Lookup image version for that queue
ImageName = /subscriptions/e7cca478-89b5-4f94-a081-4ad6ad37b08d/resourceGroups/az-hop-dev2/providers/Microsoft.Compute/galleries/azhop_1lrygvxz/images/azhop-centos79-v2-rdma-gpgpu/versions/7.9.220722214
[[[configuration]]]
slurm.partition = nc24ads-A100-v4-high
slurm.hpc = true
[[[cluster-init enroot:default:1.0.0]]]
[[nodearray ibtest_16]]
Extends = nodearraybase
MachineType = Standard_HB120-16rs_v2
MaxCoreCount = 12000
EnableAcceleratedNetworking = True
# Lookup image version for that queue
ImageName = /subscriptions/e7cca478-89b5-4f94-a081-4ad6ad37b08d/resourceGroups/az-hop-dev2/providers/Microsoft.Compute/galleries/azhop_1lrygvxz/images/azhop-centos79-v2-rdma-gpgpu/versions/7.9.220722214
[[[configuration]]]
slurm.partition = ibtest_16
slurm.hpc = True
[[[cluster-init enroot:default:1.0.0]]]
# cluster/slurm1/status API
{
"state" : "Started",
"targetState" : "Started",
"maxCount" : 10000,
"maxCoreCount" : 1000000,
"nodearrays" : [ {
"name" : "ibtest_16",
"maxCount" : 10000,
"maxCoreCount" : 12000,
"nodearray" : {
"UsePublicNetwork" : false,
"Extends" : "nodearraybase",
"Volumes" : {
"boot" : {
"StorageAccountType" : "StandardSSD_LRS",
"Name" : "boot"
}
},
"SubnetId" : "az-hop-dev2/hpcvnet/compute",
"Region" : "eastus",
"Status" : "Activated",
"State" : "Activated",
"MaxCoreCount" : 12000,
"MachineType" : "Standard_HB120-16rs_v2",
"ActivePhases" : [ ],
"Credentials" : "azure",
"EnableAcceleratedNetworking" : true,
"AccountName" : "azure",
"Configuration" : {
"cyclecloud" : {
"exports" : {
"sched" : {
"samba" : {
"enabled" : false
},
"disabled" : true
},
"shared" : {
"samba" : {
"enabled" : false
},
"disabled" : true
},
"defaults" : {
"samba" : {
"enabled" : false
}
}
},
"hosts" : {
"standalone_dns" : {
"enabled" : false
},
"simple_vpc_dns" : {
"enabled" : false
}
},
"mounts" : {
"sched" : {
"disabled" : true
},
"shared" : {
"disabled" : true
},
"nfs_anfhome" : {
"type" : "nfs",
"mountpoint" : "/anfhome",
"address" : "10.0.2.4",
"export_path" : "home-1lrygvxz"
},
"nfs_sched" : {
"type" : "nfs",
"mountpoint" : "/sched",
"address" : "10.0.2.4",
"export_path" : "home-1lrygvxz/slurm/config"
}
},
"converge_on_boot" : true
},
"keepalive" : {
"timeout" : 3600
},
"cshared" : {
"server" : {
"legacy_links_disabled" : true
}
},
"slurm" : {
"user" : {
"gid" : 11100,
"uid" : 11100
},
"install" : false,
"hpc" : true,
"autoscale" : true,
"partition" : "ibtest_16",
"version" : "20.11.9-1",
"dampen_memory" : 8,
"use_nodename_as_hostname" : true,
"accounting" : {
"enabled" : false
}
},
"munge" : {
"user" : {
"gid" : 11101,
"uid" : 11101
}
}
},
"Interruptible" : false,
"KeyPairLocation" : "~/.ssh/cyclecloud.pem",
"ClusterInitSpecs" : {
"slurm:default" : {
"Order" : 1000,
"Spec" : "default",
"Name" : "cyclecloud/slurm:default:2.6.2",
"Project" : "slurm",
"Version" : "2.6.2",
"SourceLocker" : "cyclecloud"
},
"enroot:default" : {
"Order" : 1005,
"Spec" : "default",
"Name" : "enroot:default:1.0.0",
"Project" : "enroot",
"Version" : "1.0.0"
},
"slurm:execute" : {
"Order" : 1002,
"Spec" : "execute",
"Name" : "cyclecloud/slurm:execute:2.6.2",
"Project" : "slurm",
"Version" : "2.6.2",
"SourceLocker" : "cyclecloud"
},
"common:default" : {
"Order" : 1001,
"Spec" : "default",
"Name" : "common:default:1.0.0",
"Project" : "common",
"Version" : "1.0.0"
}
},
"ShutdownPolicy" : "Terminate",
"ImageName" : "/subscriptions/e7cca478-89b5-4f94-a081-4ad6ad37b08d/resourceGroups/az-hop-dev2/providers/Microsoft.Compute/galleries/azhop_1lrygvxz/images/azhop-centos79-v2-rdma-gpgpu/versions/7.9.220722214",
"PhaseMap" : {
"NodeArrayActivation" : {
"StartTime" : {
"$date" : "2022-07-22T22:02:47.450+00:00"
},
"EndTime" : {
"$date" : "2022-07-22T22:03:02.126+00:00"
},
"Status" : "Completed"
}
},
"TargetState" : "Activated"
},
"buckets" : [ {
"bucketId" : "0502aaf6-0b8b-4bb2-91f1-2727f0ec4699",
"definition" : {
"machineType" : "Standard_HB120-16rs_v2"
},
"regionalQuotaCount" : 6,
"regionalQuotaCoreCount" : 770,
"regionalConsumedCoreCount" : 230,
"familyQuotaCount" : 2,
"familyQuotaCoreCount" : 240,
"familyConsumedCoreCount" : 0,
"quotaCount" : 2,
"quotaCoreCount" : 240,
"consumedCoreCount" : 0,
"maxCount" : 2,
"maxCoreCount" : 32,
"activeCount" : 0,
"activeCoreCount" : 0,
"availableCount" : 2,
"availableCoreCount" : 32,
"valid" : true,
"invalidReason" : "",
"maxPlacementGroupSize" : 100,
"maxPlacementGroupCoreSize" : 1600,
"placementGroups" : [ {
"name" : "ibtest_16-Standard_HB120-16rs_v2-pg0",
"activeCount" : 2,
"activeCoreCount" : 32
} ],
"virtualMachine" : {
"vcpuCount" : 16,
"pcpuCount" : 16,
"gpuCount" : 0,
"vcpuQuotaCount" : 120,
"memory" : 445.31,
"infiniband" : true
}
} ]
}, {
"name" : "nc24-high",
"maxCount" : 10000,
"maxCoreCount" : 2400,
"nodearray" : {
"UsePublicNetwork" : false,
"Extends" : "nodearraybase",
"Volumes" : {
"boot" : {
"StorageAccountType" : "StandardSSD_LRS",
"Name" : "boot"
}
},
"SubnetId" : "az-hop-dev2/hpcvnet/compute",
"Region" : "eastus",
"Status" : "Activated",
"State" : "Activated",
"MaxCoreCount" : 2400,
"MachineType" : "Standard_NC24",
"ActivePhases" : [ ],
"Credentials" : "azure",
"EnableAcceleratedNetworking" : false,
"AccountName" : "azure",
"Configuration" : {
"cyclecloud" : {
"exports" : {
"sched" : {
"samba" : {
"enabled" : false
},
"disabled" : true
},
"shared" : {
"samba" : {
"enabled" : false
},
"disabled" : true
},
"defaults" : {
"samba" : {
"enabled" : false
}
}
},
"hosts" : {
"standalone_dns" : {
"enabled" : false
},
"simple_vpc_dns" : {
"enabled" : false
}
},
"mounts" : {
"sched" : {
"disabled" : true
},
"shared" : {
"disabled" : true
},
"nfs_anfhome" : {
"type" : "nfs",
"mountpoint" : "/anfhome",
"address" : "10.0.2.4",
"export_path" : "home-1lrygvxz"
},
"nfs_sched" : {
"type" : "nfs",
"mountpoint" : "/sched",
"address" : "10.0.2.4",
"export_path" : "home-1lrygvxz/slurm/config"
}
},
"converge_on_boot" : true
},
"cshared" : {
"server" : {
"legacy_links_disabled" : true
}
},
"keepalive" : {
"timeout" : 3600
},
"slurm" : {
"hpc" : true,
"autoscale" : true,
"user" : {
"gid" : 11100,
"uid" : 11100
},
"install" : false,
"partition" : "nc24-high",
"version" : "20.11.9-1",
"dampen_memory" : 8,
"use_nodename_as_hostname" : true,
"accounting" : {
"enabled" : false
}
},
"munge" : {
"user" : {
"gid" : 11101,
"uid" : 11101
}
}
},
"Interruptible" : false,
"KeyPairLocation" : "~/.ssh/cyclecloud.pem",
"ClusterInitSpecs" : {
"slurm:default" : {
"Order" : 1000,
"Spec" : "default",
"Name" : "cyclecloud/slurm:default:2.6.2",
"Project" : "slurm",
"Version" : "2.6.2",
"SourceLocker" : "cyclecloud"
},
"slurm:execute" : {
"Order" : 1002,
"Spec" : "execute",
"Name" : "cyclecloud/slurm:execute:2.6.2",
"Project" : "slurm",
"Version" : "2.6.2",
"SourceLocker" : "cyclecloud"
},
"enroot:default" : {
"Order" : 1004,
"Spec" : "default",
"Name" : "enroot:default:1.0.0",
"Project" : "enroot",
"Version" : "1.0.0"
},
"common:default" : {
"Order" : 1001,
"Spec" : "default",
"Name" : "common:default:1.0.0",
"Project" : "common",
"Version" : "1.0.0"
}
},
"ShutdownPolicy" : "Terminate",
"ImageName" : "/subscriptions/e7cca478-89b5-4f94-a081-4ad6ad37b08d/resourceGroups/az-hop-dev2/providers/Microsoft.Compute/galleries/azhop_1lrygvxz/images/azhop-centos79-v1-rdma-gpgpu/versions/7.9.220722210",
"PhaseMap" : {
"NodeArrayActivation" : {
"StartTime" : {
"$date" : "2022-07-23T03:40:56.291+00:00"
},
"EndTime" : {
"$date" : "2022-07-23T03:41:11.107+00:00"
},
"Status" : "Completed"
}
},
"TargetState" : "Activated"
},
"buckets" : [ {
"bucketId" : "0fe50db5-49b0-4343-b553-945243087743",
"definition" : {
"machineType" : "Standard_NC24"
},
"regionalQuotaCount" : 32,
"regionalQuotaCoreCount" : 770,
"regionalConsumedCoreCount" : 230,
"familyQuotaCount" : 4,
"familyQuotaCoreCount" : 100,
"familyConsumedCoreCount" : 24,
"quotaCount" : 4,
"quotaCoreCount" : 100,
"consumedCoreCount" : 24,
"maxCount" : 4,
"maxCoreCount" : 96,
"activeCount" : 1,
"activeCoreCount" : 24,
"availableCount" : 3,
"availableCoreCount" : 72,
"valid" : true,
"invalidReason" : "",
"maxPlacementGroupSize" : 100,
"maxPlacementGroupCoreSize" : 2400,
"placementGroups" : [ {
"name" : "nc24-high-Standard_NC24-pg0",
"activeCount" : 4,
"activeCoreCount" : 96
} ],
"virtualMachine" : {
"vcpuCount" : 24,
"pcpuCount" : 24,
"gpuCount" : 4,
"vcpuQuotaCount" : 24,
"memory" : 224.0,
"infiniband" : false
}
} ]
}, {
"name" : "nc24ads-A100-v4-high",
"maxCount" : 10000,
"maxCoreCount" : 2400,
"nodearray" : {
"UsePublicNetwork" : false,
"Extends" : "nodearraybase",
"Volumes" : {
"boot" : {
"StorageAccountType" : "StandardSSD_LRS",
"Name" : "boot"
}
},
"SubnetId" : "az-hop-dev2/hpcvnet/compute",
"Region" : "eastus",
"Status" : "Activated",
"State" : "Activated",
"MaxCoreCount" : 2400,
"MachineType" : "Standard_NC24ads_A100_v4",
"ActivePhases" : [ ],
"Credentials" : "azure",
"EnableAcceleratedNetworking" : false,
"AccountName" : "azure",
"Configuration" : {
"cyclecloud" : {
"exports" : {
"sched" : {
"samba" : {
"enabled" : false
},
"disabled" : true
},
"shared" : {
"samba" : {
"enabled" : false
},
"disabled" : true
},
"defaults" : {
"samba" : {
"enabled" : false
}
}
},
"hosts" : {
"standalone_dns" : {
"enabled" : false
},
"simple_vpc_dns" : {
"enabled" : false
}
},
"mounts" : {
"sched" : {
"disabled" : true
},
"shared" : {
"disabled" : true
},
"nfs_anfhome" : {
"type" : "nfs",
"mountpoint" : "/anfhome",
"address" : "10.0.2.4",
"export_path" : "home-1lrygvxz"
},
"nfs_sched" : {
"type" : "nfs",
"mountpoint" : "/sched",
"address" : "10.0.2.4",
"export_path" : "home-1lrygvxz/slurm/config"
}
},
"converge_on_boot" : true
},
"cshared" : {
"server" : {
"legacy_links_disabled" : true
}
},
"keepalive" : {
"timeout" : 3600
},
"slurm" : {
"hpc" : true,
"autoscale" : true,
"user" : {
"gid" : 11100,
"uid" : 11100
},
"install" : false,
"partition" : "nc24ads-A100-v4-high",
"version" : "20.11.9-1",
"dampen_memory" : 8,
"use_nodename_as_hostname" : true,
"accounting" : {
"enabled" : false
}
},
"munge" : {
"user" : {
"gid" : 11101,
"uid" : 11101
}
}
},
"Interruptible" : false,
"KeyPairLocation" : "~/.ssh/cyclecloud.pem",
"ClusterInitSpecs" : {
"slurm:default" : {
"Order" : 1000,
"Spec" : "default",
"Name" : "cyclecloud/slurm:default:2.6.2",
"Project" : "slurm",
"Version" : "2.6.2",
"SourceLocker" : "cyclecloud"
},
"slurm:execute" : {
"Order" : 1002,
"Spec" : "execute",
"Name" : "cyclecloud/slurm:execute:2.6.2",
"Project" : "slurm",
"Version" : "2.6.2",
"SourceLocker" : "cyclecloud"
},
"enroot:default" : {
"Order" : 1004,
"Spec" : "default",
"Name" : "enroot:default:1.0.0",
"Project" : "enroot",
"Version" : "1.0.0"
},
"common:default" : {
"Order" : 1001,
"Spec" : "default",
"Name" : "common:default:1.0.0",
"Project" : "common",
"Version" : "1.0.0"
}
},
"ShutdownPolicy" : "Terminate",
"ImageName" : "/subscriptions/e7cca478-89b5-4f94-a081-4ad6ad37b08d/resourceGroups/az-hop-dev2/providers/Microsoft.Compute/galleries/azhop_1lrygvxz/images/azhop-centos79-v2-rdma-gpgpu/versions/7.9.220722214",
"PhaseMap" : {
"NodeArrayActivation" : {
"StartTime" : {
"$date" : "2022-07-23T04:00:09.914+00:00"
},
"EndTime" : {
"$date" : "2022-07-23T04:00:25.076+00:00"
},
"Status" : "Completed"
}
},
"TargetState" : "Activated"
},
"buckets" : [ {
"bucketId" : "cec4286f-0a26-4001-bf85-fb42ae700d86",
"definition" : {
"machineType" : "Standard_NC24ads_A100_v4"
},
"regionalQuotaCount" : 32,
"regionalQuotaCoreCount" : 770,
"regionalConsumedCoreCount" : 230,
"familyQuotaCount" : 1,
"familyQuotaCoreCount" : 24,
"familyConsumedCoreCount" : 24,
"quotaCount" : 1,
"quotaCoreCount" : 24,
"consumedCoreCount" : 24,
"maxCount" : 1,
"maxCoreCount" : 24,
"activeCount" : 1,
"activeCoreCount" : 24,
"availableCount" : 0,
"availableCoreCount" : 0,
"valid" : true,
"invalidReason" : "",
"maxPlacementGroupSize" : 100,
"maxPlacementGroupCoreSize" : 2400,
"placementGroups" : [ {
"name" : "nc24ads-A100-v4-high-Standard_NC24ads_A100_v4-pg0",
"activeCount" : 1,
"activeCoreCount" : 24
} ],
"virtualMachine" : {
"vcpuCount" : 24,
"pcpuCount" : 24,
"gpuCount" : 1,
"vcpuQuotaCount" : 24,
"memory" : 220.0,
"infiniband" : false
}
} ]
}, {
"name" : "test_d4",
"maxCount" : 10000,
"maxCoreCount" : 2400,
"nodearray" : {
"UsePublicNetwork" : false,
"Extends" : "nodearraybase",
"Volumes" : {
"boot" : {
"StorageAccountType" : "StandardSSD_LRS",
"Name" : "boot"
}
},
"SubnetId" : "az-hop-dev2/hpcvnet/compute",
"Region" : "eastus",
"Status" : "Activated",
"State" : "Activated",
"MaxCoreCount" : 2400,
"MachineType" : "Standard_D4s_v5",
"ActivePhases" : [ ],
"Credentials" : "azure",
"EnableAcceleratedNetworking" : true,
"AccountName" : "azure",
"Configuration" : {
"cyclecloud" : {
"exports" : {
"sched" : {
"samba" : {
"enabled" : false
},
"disabled" : true
},
"shared" : {
"samba" : {
"enabled" : false
},
"disabled" : true
},
"defaults" : {
"samba" : {
"enabled" : false
}
}
},
"hosts" : {
"standalone_dns" : {
"enabled" : false
},
"simple_vpc_dns" : {
"enabled" : false
}
},
"mounts" : {
"sched" : {
"disabled" : true
},
"shared" : {
"disabled" : true
},
"nfs_anfhome" : {
"type" : "nfs",
"mountpoint" : "/anfhome",
"address" : "10.0.2.4",
"export_path" : "home-1lrygvxz"
},
"nfs_sched" : {
"type" : "nfs",
"mountpoint" : "/sched",
"address" : "10.0.2.4",
"export_path" : "home-1lrygvxz/slurm/config"
}
},
"converge_on_boot" : true
},
"keepalive" : {
"timeout" : 3600
},
"cshared" : {
"server" : {
"legacy_links_disabled" : true
}
},
"slurm" : {
"default_partition" : true,
"user" : {
"gid" : 11100,
"uid" : 11100
},
"install" : false,
"hpc" : true,
"autoscale" : true,
"partition" : "test_d4",
"version" : "20.11.9-1",
"dampen_memory" : 8,
"use_nodename_as_hostname" : true,
"accounting" : {
"enabled" : false
}
},
"munge" : {
"user" : {
"gid" : 11101,
"uid" : 11101
}
}
},
"Interruptible" : false,
"KeyPairLocation" : "~/.ssh/cyclecloud.pem",
"ClusterInitSpecs" : {
"slurm:default" : {
"Order" : 1000,
"Spec" : "default",
"Name" : "cyclecloud/slurm:default:2.6.2",
"Project" : "slurm",
"Version" : "2.6.2",
"SourceLocker" : "cyclecloud"
},
"slurm:execute" : {
"Order" : 1002,
"Spec" : "execute",
"Name" : "cyclecloud/slurm:execute:2.6.2",
"Project" : "slurm",
"Version" : "2.6.2",
"SourceLocker" : "cyclecloud"
},
"enroot:default" : {
"Order" : 1003,
"Spec" : "default",
"Name" : "enroot:default:1.0.0",
"Project" : "enroot",
"Version" : "1.0.0"
},
"common:default" : {
"Order" : 1001,
"Spec" : "default",
"Name" : "common:default:1.0.0",
"Project" : "common",
"Version" : "1.0.0"
}
},
"ShutdownPolicy" : "Terminate",
"ImageName" : "/subscriptions/e7cca478-89b5-4f94-a081-4ad6ad37b08d/resourceGroups/az-hop-dev2/providers/Microsoft.Compute/galleries/azhop_1lrygvxz/images/azhop-centos79-v2-rdma-gpgpu/versions/7.9.220722214",
"PhaseMap" : {
"NodeArrayActivation" : {
"StartTime" : {
"$date" : "2022-07-22T22:02:47.450+00:00"
},
"EndTime" : {
"$date" : "2022-07-22T22:03:02.126+00:00"
},
"Status" : "Completed"
}
},
"TargetState" : "Activated"
},
"buckets" : [ {
"bucketId" : "7bd072fa-22b1-4de9-b0b2-8b6bcab2c08d",
"definition" : {
"machineType" : "Standard_D4s_v5"
},
"regionalQuotaCount" : 192,
"regionalQuotaCoreCount" : 770,
"regionalConsumedCoreCount" : 230,
"familyQuotaCount" : 25,
"familyQuotaCoreCount" : 100,
"familyConsumedCoreCount" : 4,
"quotaCount" : 25,
"quotaCoreCount" : 100,
"consumedCoreCount" : 4,
"maxCount" : 25,
"maxCoreCount" : 100,
"activeCount" : 0,
"activeCoreCount" : 0,
"availableCount" : 24,
"availableCoreCount" : 96,
"valid" : true,
"invalidReason" : "",
"maxPlacementGroupSize" : 100,
"maxPlacementGroupCoreSize" : 400,
"placementGroups" : [ {
"name" : "test_d4-Standard_D4s_v5-pg0",
"activeCount" : 24,
"activeCoreCount" : 96
} ],
"virtualMachine" : {
"vcpuCount" : 4,
"pcpuCount" : 2,
"gpuCount" : 0,
"vcpuQuotaCount" : 4,
"memory" : 16.0,
"infiniband" : false
}
} ]
} ]
}
@matt-chan I know it's not ideal today, but deleting a cluster may have an impact on running jobs, so we have to validate this before implementing. azhop rely on CycleCloud implementation as well so unless they change their implementation we may have to delete the cluster.
as a temporary workaround you can login on the ccportal VM and remote the cluster with the cyclecloud CLI
Hi @xpillons, we ran into a related issue of this bug recently. If we want to upload new versions of images, it appears that the only way to make cyclecloud use them is to reload the queues ('latest' doesn't seem to refresh until the cluster is reloaded).
Could you share a bit more information about how you use make updates to the queues in your own workflow so I can formulate a workaround for us please? Do you only add new queues, and never remove/modify old queues?
Right now we're restarting the entire cluster, which is acceptable because it is unlikely we will modify the queues in production, but if we have to do it every time the image changes, it would be a problem for us.
Also, do you have information on where to find the cyclecloud server source code? I can only find the plugins, so I've been reading the live deployment source code with Vim and grep, which is really not fun. Feel free to email me if the repo is internal-only. Thanks for your help!
Hi @matt-chan, a new image ID can be assign by running the cccluster and scheduler playbooks. For that new image to be picked up by new nodes all nodes in the same scaleset need to be teared down. You usually keep the same queue names as this will make life easier for users submitting jobs. We haven't find an easier way of doing it at the moment, but one could be to :
please contact me internally
In what area(s)?
/area job-scheduling
Expected Behavior
cyclecloud.conf and topology.conf should reflect config.yaml
Actual Behavior
Adding a queue into config.yml and then later removing it doesn't remove the entry in cyclecloud.conf or topology.conf.
Also leads to 10-queue limit on cyclecloud quickly, and can lead to conflicts in conf files.
Steps to Reproduce the Problem
Create config.yml with some queues and slurm. Remove some queues and re-run install.sh cccluster and scheduler. cyclecloud.conf and topology.conf don't reflect it.