NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

Support for Mariner (Azure Linux) #113

Closed LiquidPT closed 10 months ago

LiquidPT commented 10 months ago

I'm trying to create an official Microsoft Azure HPC/AI VM image based on Azure Linux (CBL-Mariner 2.0) to run on Azure virtual machines (Azure VMSS) and I can't find a version of DCGM that will install on it.

  1. I have tried the latest Fedora DCGM installation (since many people have suggested Mariner is similar to Fedora) at https://developer.download.nvidia.com/compute/cuda/repos/fedora37/x86_64/cuda-fedora37.repo , and get back "No valid Platform ID detected". I also tried the RHEL8 repo since they seem to have some similarity as well.
  2. I tried cloning the GitHub repo to the VM and running /dcgmbuilds/build.sh, but the version of docker that's available for Mariner (20.10.25) errors out on many of the options for docker build, starting with "--compress", btu continuing from there if I try deleting that one.

Are there any plans for supporting Mariner and/or are there any workarounds to enable using it?

nikkon-dev commented 10 months ago

@LiquidPT,

Are you able to install the https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/datacenter-gpu-manager-3.2.6-1-x86_64.rpm directly? Please reach out to Kalpesh Patel at Microsoft to avoid duplicating efforts. There's already a communication channel regarding Mariner.

LiquidPT commented 10 months ago

That appears to have worked. I'll reach out to Kalpesh though.