Azure / cyclecloud-slurm

Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
MIT License
59 stars 43 forks source link

Compute nodes don't get terminated despite Slurm indicating idle~ status #267

Open jhrmnn opened 4 months ago

jhrmnn commented 4 months ago

CycleCloud version: 8.6.2-3276 Slurm version: 22.05.11

Autoscaling down after the job queue gets empty worked for me successfully numerous times until it didn't after full occupancy of the cluster lasting several days. All jobs were then killed, the compute nodes transitioned to idle~, but CycleCloud didn't deprovision the VMs. How can I investigate the cause of this behavior?

aditigaur4 commented 4 months ago

Can you check for any messages in /opt/azurehpc/slurm/logs/autoscale.log and shutdown.log in the same directory?

aditigaur4 commented 4 months ago

Also are the VM's showing up in cyclecloud? Can you verify if you set KeepAlive on them through cyclecloud UI?