Data Center GPU Manager (DCGM) is a daemon that allows users to monitor NVIDIA data-center GPUs. You can find out more about DCGM by visiting DCGM's official page
NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management. It can be used standalone by infrastructure teams and easily integrates into cluster management tools, resource scheduling and monitoring products from NVIDIA partners.
DCGM simplifies GPU administration in the data center, improves resource reliability and uptime, automates administrative tasks, and helps drive overall infrastructure efficiency. DCGM supports Linux operating systems on x86_64, Arm and POWER (ppc64le) platforms. The installer packages include libraries, binaries, NVIDIA Validation Suite (NVVS) and source examples for using the API (C, Python and Go).
DCGM integrates into the Kubernetes ecosystem by allowing users to gather GPU telemetry using dcgm-exporter.
More information is available on DCGM's official page
DCGM installer packages are available on the CUDA network repository and DCGM can be easily installed using Linux package managers.
Set up the CUDA network repository meta-data, GPG key:
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
$ sudo dpkg -i cuda-keyring_1.0-1_all.deb
$ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
Install DCGM
$ sudo apt-get update \
&& sudo apt-get install -y datacenter-gpu-manager
Set up the CUDA network repository meta-data, GPG key:
$ sudo dnf config-manager \
--add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
Install DCGM
$ sudo dnf clean expire-cache \
&& sudo dnf install -y datacenter-gpu-manager
$ sudo systemctl --now enable nvidia-dcgm
For information on platform support, getting started and using DCGM APIs, visit the official documentation repository.
Once this repo is cloned, DCGM can be built by:
The Docker build image provides two benefits
New dependencies can be added by adding a script in the “scripts” directory similar to the existing scripts.
As DCGM needs to support some older Linux distributions on various CPU architectures, the image provide custom builds of GCC compilers that produce binaries which depend on older versions of the GLibc libraries. The DCGM build image will also contain all 3rd party libraries that are precompiled using those custom GCC builds.
In order to create the build image and to then generate a DCGM build, you will need to have the following installed and configured:
The build.sh script was tested in Linux, Windows (WSL2) and MacOS, though MacOS
may need some minor changes in the script (like s/awk/gawk/
) as MacOS is not an
officially supported development environment.
The build image is stored in ./dcgmbuild
.
The image can be built by:
./dcgmbuild
./build.sh
Note that if your user does not have permission to access the Docker socket, you will need to run sudo ./build.sh
The build process may take several hours to create the image as the image is building 3 versions of GCC toolset for all supported platforms. Once the image has been built, it can be reused to build DCGM.
Once the build image is created, you can use the run build.sh
to produce builds. A simple debian build of release (non-debug) code for an x86_64 system can be made with:
./build.sh -r --deb
The rpm will be placed in _out/Linux-amd64-release/datacenter-gpu-manager_2.1.4_amd64.deb
; it can now be installed as needed. The script includes options for building just the binaries (default), tarballs (--packages), or RPM (--rpm) as well. A complete list of options can been seen using ./build.sh -h
.
DCGM includes an extensive test suite that can be run on any system with one or more supported GPUs. After successfully building DCGM, a datacenter-gpu-manager-tests
package is created alongside the normal DCGM package. There are multiple ways to run the tests but the most straightforward steps are the following:
usr/share/dcgm_tests
run_tests.sh
Notes:
.deb
or .rpm
file then the location is /usr/share/dcgm_tests
. If the package was .tar.gz
then the location is relative to where it was uncompresssed.Issues in DCGM can be reported by opening an issue in Github. Please include in reporting an issue:
nv-hostengine
, you can start it with -f /tmp/hostengine.log --log-level ERROR
to generate a log file with all error messages in /tmp/hostengine.log
.--debugLogFile /tmp/diag.log -d ERROR
to your command line in order to generate /tmp/diag.log
with all error messages.nvidia-smi
and nvidia-smi -q
.dcgmi -v
.The following template may be helpful:
GPU SKU(s):
OS:
DRIVER:
GPU power settings (nvidia-smi -q -d POWER):
CPU(s):
RAM:
Topology (nvidia-smi topo -m):
We ask that all community members and users of DCGM follow the standard Nvidia process for reporting security vulnerabilities. This process is documented at the NVIDIA Product Security website. Following the process will result in any needed CVE being created as well as appropriate notifications being communicated to the entire DCGM community.
Please refer to the policies listed there to answer questions related to reporting security issues.
DCGM releases will be tagged once the release is finalized. The last commit will be the one that sets the release version, and we will then tag the releases. Releases tags will be the release version prepended with a v. For example, v2.0.13
.
The source code for DCGM in this repository is licensed under Apache 2.0. Binary installer packages for DCGM are available for download from the product page and are licensed under the NVIDIA DCGM SLA.