NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
393 stars 50 forks source link

Support for Amazon Linux 2023 (AL2023) #181

Open mbacchi opened 2 months ago

mbacchi commented 2 months ago

In your DCGM documentation you indicate that the supported platforms do not include Amazon Linux 2023 (AL2023) 0.

But you do support AL2023 in your CUDA toolkit 1, and have many packages for AL2023 in your repository 2.

We primarily use dcgm-exporter and in the past it was possible to build dcgm-exporter for the previous version of Amazon Linux 2 (AL2) by installing the RHEL version of the DCGM package from your package repositories. But AL2023 is not compatible with RHEL and therefore the packages in your repositories don't work on AL2023. Ideally I wouldn't have to build dcgm-exporter but simply install it from your package repository, like other Linux distributions do easily.

I've tried building DCGM from source on AL2023 and get an error unfortunately. I'm not going to document that error here because the more pertinent question is: why do you provide DCGM packages for other Linux distributions but not AL2023?

I'm officially requesting the following packages be built for AL2023 and provided in your amzn2023 package repository 2:

Thanks for your support!

-Matt