NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
355 stars 49 forks source link

NVIDIA Data Center GPU Manager

GitHub license

Data Center GPU Manager (DCGM) is a daemon that allows users to monitor NVIDIA data-center GPUs. You can find out more about DCGM by visiting DCGM's official page

dcgm

Introduction

NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management. It can be used standalone by infrastructure teams and easily integrates into cluster management tools, resource scheduling and monitoring products from NVIDIA partners.

DCGM simplifies GPU administration in the data center, improves resource reliability and uptime, automates administrative tasks, and helps drive overall infrastructure efficiency. DCGM supports Linux operating systems on x86_64, Arm and POWER (ppc64le) platforms. The installer packages include libraries, binaries, NVIDIA Validation Suite (NVVS) and source examples for using the API (C, Python and Go).

DCGM integrates into the Kubernetes ecosystem by allowing users to gather GPU telemetry using dcgm-exporter.

More information is available on DCGM's official page

Quickstart

DCGM installer packages are available on the CUDA network repository and DCGM can be easily installed using Linux package managers.

Ubuntu LTS

Set up the CUDA network repository meta-data, GPG key:

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
$ sudo dpkg -i cuda-keyring_1.0-1_all.deb
$ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"

Install DCGM

$ sudo apt-get update \
    && sudo apt-get install -y datacenter-gpu-manager

Red Hat

Set up the CUDA network repository meta-data, GPG key:

$ sudo dnf config-manager \
    --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

Install DCGM

$ sudo dnf clean expire-cache \
    && sudo dnf install -y datacenter-gpu-manager

Start the DCGM service

$ sudo systemctl --now enable nvidia-dcgm

Product Documentation

For information on platform support, getting started and using DCGM APIs, visit the official documentation repository.

Building DCGM

Once this repo is cloned, DCGM can be built by:

Why a Docker build image

The Docker build image provides two benefits

New dependencies can be added by adding a script in the “scripts” directory similar to the existing scripts.

As DCGM needs to support some older Linux distributions on various CPU architectures, the image provide custom builds of GCC compilers that produce binaries which depend on older versions of the GLibc libraries. The DCGM build image will also contain all 3rd party libraries that are precompiled using those custom GCC builds.

Prerequisites

In order to create the build image and to then generate a DCGM build, you will need to have the following installed and configured:

The build.sh script was tested in Linux, Windows (WSL2) and MacOS, though MacOS may need some minor changes in the script (like s/awk/gawk/) as MacOS is not an officially supported development environment.

Creating the build image

The build image is stored in ./dcgmbuild.

The image can be built by:

Note that if your user does not have permission to access the Docker socket, you will need to run sudo ./build.sh

The build process may take several hours to create the image as the image is building 3 versions of GCC toolset for all supported platforms. Once the image has been built, it can be reused to build DCGM.

Generating a DCGM build

Once the build image is created, you can use the run build.sh to produce builds. A simple debian build of release (non-debug) code for an x86_64 system can be made with:

./build.sh -r --deb

The rpm will be placed in _out/Linux-amd64-release/datacenter-gpu-manager_2.1.4_amd64.deb; it can now be installed as needed. The script includes options for building just the binaries (default), tarballs (--packages), or RPM (--rpm) as well. A complete list of options can been seen using ./build.sh -h.

Running the Test Framework

DCGM includes an extensive test suite that can be run on any system with one or more supported GPUs. After successfully building DCGM, a datacenter-gpu-manager-tests package is created alongside the normal DCGM package. There are multiple ways to run the tests but the most straightforward steps are the following:

  1. Install or extract the datacenter-gpu-manager-tests package
  2. Navigate to usr/share/dcgm_tests
  3. Execute run_tests.sh

Notes:

Reporting An Issue

Issues in DCGM can be reported by opening an issue in Github. Please include in reporting an issue:

The following template may be helpful:

GPU SKU(s):
OS:
DRIVER:
GPU power settings (nvidia-smi -q -d POWER):
CPU(s):
RAM:
Topology (nvidia-smi topo -m):

Reporting Security Issues

We ask that all community members and users of DCGM follow the standard Nvidia process for reporting security vulnerabilities. This process is documented at the NVIDIA Product Security website. Following the process will result in any needed CVE being created as well as appropriate notifications being communicated to the entire DCGM community.

Please refer to the policies listed there to answer questions related to reporting security issues.

Tagging DCGM Releases

DCGM releases will be tagged once the release is finalized. The last commit will be the one that sets the release version, and we will then tag the releases. Releases tags will be the release version prepended with a v. For example, v2.0.13.

License

The source code for DCGM in this repository is licensed under Apache 2.0. Binary installer packages for DCGM are available for download from the product page and are licensed under the NVIDIA DCGM SLA.

Additional Topics