Nodes (Standard_NC24ads_A100_v4) should autoscale and be configured properly for workloads.
Actual Behavior
Nodes have spawned and report that they are being configured for workloads, but ultimately terminate before the job is started. I can log into the node via SSH and see that it is active, but something isn't connecting correctly. I will post back with the specific error message (was a python error from jetpack) when I get a chance to try again, but these were working up to a week or so prior.
Steps to Reproduce the Problem
Deploy gpu cluster (Standard_NC24ads_A100_v4) using the versions above.
Version
v1.0.40 slurm: 22.05.3 cyclecloud: 2.7.2
In what area(s)?
/area ansible /area autoscaling /area configuration /area cyclecloud
Expected Behavior
Nodes (Standard_NC24ads_A100_v4) should autoscale and be configured properly for workloads.
Actual Behavior
Nodes have spawned and report that they are being configured for workloads, but ultimately terminate before the job is started. I can log into the node via SSH and see that it is active, but something isn't connecting correctly. I will post back with the specific error message (was a python error from jetpack) when I get a chance to try again, but these were working up to a week or so prior.
Steps to Reproduce the Problem
Deploy gpu cluster (Standard_NC24ads_A100_v4) using the versions above.