Azure / az-hop

The Azure HPC On-Demand Platform provides an HPC Cluster Ready solution
https://azure.github.io/az-hop/
MIT License
63 stars 54 forks source link

Nodes (Standard_NC24ads_A100_v4) not configuring anymore for v1.0.40 #1919

Open jlphillipsphd opened 2 weeks ago

jlphillipsphd commented 2 weeks ago

Version

v1.0.40 slurm: 22.05.3 cyclecloud: 2.7.2

In what area(s)?

/area ansible /area autoscaling /area configuration /area cyclecloud

Expected Behavior

Nodes (Standard_NC24ads_A100_v4) should autoscale and be configured properly for workloads.

Actual Behavior

Nodes have spawned and report that they are being configured for workloads, but ultimately terminate before the job is started. I can log into the node via SSH and see that it is active, but something isn't connecting correctly. I will post back with the specific error message (was a python error from jetpack) when I get a chance to try again, but these were working up to a week or so prior.

Steps to Reproduce the Problem

Deploy gpu cluster (Standard_NC24ads_A100_v4) using the versions above.

jlphillipsphd commented 2 weeks ago

Wow, fishing in the first few seconds, not suspect at all...

xpillons commented 2 weeks ago

can you please check the slurm logs on the node when this occurs ?