Azure / cyclecloud-slurm

Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
MIT License
56 stars 42 forks source link

GPU allocation stuck on "Waiting for resource configuration" #114

Open maharjun opened 1 year ago

maharjun commented 1 year ago

I've uploaded the version 2.7.1 to the cluster (after downloading the blobs from github). I've modified the template to include a gpu cluster along with parameter definitions corresponding to that cluster. I've selected NC6 as the GPU machine name and started the cluster. Everything starts fine and I'm able to allocate the F32_vs nodes that correspond to the hpc partition. However when allocating the GPU node, the node starts without any error reported in the console, however, in the scheduler, slurm does not appear to recognize this fact and is still stuck on

[hpcadmin@ip-0A030006 ~]$ salloc -p gpu -n 1
salloc: Granted job allocation 3
salloc: Waiting for resource configuration

This is a problem with both the CentOS 7 and Almalinux 8 operating systems. I have the following cloud-init scripts to install singularity in each case. These don't appear to be a problem as any error in the script typically gets reported as an error in the web console.

CentOS 7:

#!/usr/bin/bash
wget https://github.com/sylabs/singularity/releases/download/v3.9.9/singularity-ce-3.9.9-1.el7.x86_64.rpm
sudo yum localinstall -y ./singularity-ce-3.9.9-1.el7.x86_64.rpm
rm ./singularity-ce-3.9.9-1.el7.x86_64.rpm

Almalinux 8

#!/usr/bin/bash
wget https://github.com/sylabs/singularity/releases/download/v3.9.9/singularity-ce-3.9.9-1.el8.x86_64.rpm
sudo yum localinstall -y ./singularity-ce-3.9.9-1.el8.x86_64.rpm
rm ./singularity-ce-3.9.9-1.el8.x86_64.rpm
anhoward commented 1 year ago

Which image are you using? Does it have the GPU drivers installed? If not, it's probably not reporting the right number of GPUs so Slurm doesn't consider the node to be "healthy". You won't see this error in the console, and would have to catch it at just the right time to see the node in a DOWN state in Slurm. If you can get the slurmd logs from the node and the slurmctld logs from the scheduler, that would tell for sure. I would suggest opening a support ticket in the Azure portal so one of our engineers can help.

maharjun commented 1 year ago

I figured this out pretty much exactly as you described, however a bigger problem is that AlmaLinux 8 - HPC (the default image of cyclecloud-slurm) doesn't support NC series instances (confirmed this by creating a standalone NC6 Vm with the almalinux-hpc image), probably due to an incompatibility with drivers. It works with NV series nodes though.

I just wish these sort of things were well documented (either in the cyclecloud-slurm readme or elsewhere). I have contacted the support channels