NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
858 stars 152 forks source link

dcgmi version and dcgm-exporter version #319

Closed nghtm closed 4 months ago

nghtm commented 4 months ago

Ask your question

Hi,

I am hoping to understand the difference between the dcgmi -v version and the version of dcgm exporter which should be used.

I want to undertstand what version of dcgm exporter I should specify for my docker container. When I run the following, I see dcgmi version = 3.3.5

ubuntu@ip-10-1-22-213:~$ dcgmi -v
Version : 3.3.5
Build ID : 14
Build Date : 2024-02-24
Build Type : Release
Commit ID : 93088b0e1286c6e7723af1930251298870e26c19
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 08a0d9624b562a1342bf5f8828939294

When I create my docker container, what verison should I specify?

    # Set DCGM Exporter version
    DCGM_EXPORTER_VERSION=3.3.5-3.4.0-ubuntu22.04

    # Run the DCGM Exporter Docker container
    sudo docker run -d --restart always \
       --gpus all \
       --net host \
       --cap-add SYS_ADMIN \
       -v /opt/dcgm-exporter/dcgm-golden-metrics.csv:/etc/dcgm-exporter/dcgm-golden-metrics.csv \
       nvcr.io/nvidia/k8s/dcgm-exporter:${DCGM_EXPORTER_VERSION} \
       -f /etc/dcgm-exporter/dcgm-golden-metrics.csv || { echo "Failed to run DCGM Exporter Docker container"; exit 1; }
glowkey commented 4 months ago

The first set of numbers in the DCGM-Exporter version correspond to the DCGM library version used in the container and in testing (3.3.5 in your case). The second set of numbers (3.4.0) correspond to the DCGM-Exporter version. However, DCGM follows semver compatibility guidelines so any 3.x version should be compatible.

nghtm commented 4 months ago

Thank you for the response, helpful info on versions. :-)

When I try running this container with DCGM_EXPORTER_VERSION=3.3.5-3.4.0-ubuntu22.04 and dcgmi -v = 3.3.5 it fails, causing Nvidia-SMI to throw errors on gpu 0. Prior to running the container, Nvidia-smi showed all GPUs to be healthy. I examined nvidia-bug-report and found the following message:

Apr 30 21:13:04 ip-10-1-5-148 dockerd[10261]: time="2024-04-30T21:13:04.829815111Z" level=error msg="restartmanger wait error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: detection error: nvml error: unknown error: unknown"

For GPU 0 which shows ERR!, NVSMI Log shows:

==============NVSMI LOG==============

Timestamp                                 : Tue Apr 30 21:16:21 2024
Driver Version                            : 535.161.08
CUDA Version                              : 12.2

Attached GPUs                             : 8
GPU 00000000:00:16.0
    Product Name                          : NVIDIA A10G
    Product Brand                         : Unknown Error
    Product Architecture                  : Ampere
    Display Mode                          : N/A
    Display Active                        : N/A
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : Unknown Error
        Pending                           : Unknown Error
    Accounting Mode                       : N/A
    Accounting Mode Buffer Size           : N/A
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1652222014738
    GPU UUID                              : Unknown Error
    Minor Number                          : 0
    VBIOS Version                         : Unknown Error
    MultiGPU Board                        : N/A
    Board ID                              : N/A
    Board Part Number                     : 900-2G133-A840-100
    GPU Part Number                       : 2237-892-A1
    FRU Part Number                       : N/A
    Module ID                             : Unknown Error
    Inforom Version
        Image Version                     : N/A
        OEM Object                        : N/A
        ECC Object                        : N/A
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 535.161.08
    GPU Virtualization Mode
        Virtualization Mode               : N/A
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : N/A
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x00
        Device                            : 0x16
        Domain                            : 0x0000
        Device Id                         : 0x223710DE
        Bus Id                            : 00000000:00:16.0
        Sub System Id                     : 0x152F10DE
        GPU Link Info
            PCIe Generation
                Max                       : N/A
                Current                   : N/A
                Device Current            : N/A
                Device Max                : N/A
                Host Max                  : N/A
            Link Width
                Max                       : N/A
                Current                   : N/A
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : Unknown Error
        Replay Number Rollovers           : Unknown Error
        Tx Throughput                     : Unknown Error
        Rx Throughput                     : Unknown Error
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : Unknown Error
    Performance State                     : Unknown Error
    Clocks Event Reasons                  : N/A
    Sparse Operation Mode                 : Unknown Error
    FB Memory Usage
        Total                             : 23028 MiB
        Reserved                          : 512 MiB
        Used                              : 0 MiB
        Free                              : 22515 MiB
    BAR1 Memory Usage
        Total                             : N/A
        Used                              : N/A
        Free                              : N/A
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : N/A
        Memory                            : N/A
        Encoder                           : N/A
        Decoder                           : N/A
        JPEG                              : N/A
        OFA                               : N/A
    Encoder Stats
        Active Sessions                   : N/A
        Average FPS                       : N/A
        Average Latency                   : N/A
    FBC Stats
        Active Sessions                   : N/A
        Average FPS                       : N/A
        Average Latency                   : N/A
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
            SRAM Threshold Exceeded       : N/A
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : N/A
            SRAM SM                       : N/A
            SRAM Microcontroller          : N/A
            SRAM PCIE                     : N/A
            SRAM Other                    : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : Unknown Error
    Temperature
        GPU Current Temp                  : Unknown Error
        GPU T.Limit Temp                  : Unknown Error
        GPU Shutdown T.Limit Temp         : Unknown Error
        GPU Slowdown T.Limit Temp         : Unknown Error
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : Unknown Error
    GPU Power Readings
        Power Draw                        : N/A
        Current Power Limit               : 670166.31 W
        Requested Power Limit             : 0.00 W
        Default Power Limit               : Unknown Error
        Min Power Limit                   : Unknown Error
        Max Power Limit                   : Unknown Error
    Module Power Readings
        Power Draw                        : Unknown Error
        Current Power Limit               : Unknown Error
        Requested Power Limit             : 0.00 W
        Default Power Limit               : Unknown Error
        Min Power Limit                   : Unknown Error
        Max Power Limit                   : Unknown Error
    Clocks
        Graphics                          : N/A
        SM                                : N/A
        Memory                            : N/A
        Video                             : N/A
    Applications Clocks
        Graphics                          : Unknown Error
        Memory                            : Unknown Error
    Default Applications Clocks
        Graphics                          : Unknown Error
        Memory                            : Unknown Error
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : N/A
        SM                                : N/A
        Memory                            : N/A
        Video                             : N/A
    Max Customer Boost Clocks
        Graphics                          : Unknown Error
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : Unknown Error
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes                             : None
nvvfedorov commented 4 months ago

You need to install and configure the NVIDIA Container Toolkit, it seems, that it is not configured correctly and that is why you see the error:

Apr 30 21:13:04 ip-10-1-5-148 dockerd[10261]: time="2024-04-30T21:13:04.829815111Z" level=error msg="restartmanger wait error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: detection error: nvml error: unknown error: unknown"
nghtm commented 4 months ago

Thanks for the response.

nvidia-container-toolkit is installed.

ubuntu@ip-10-1-5-148:/var/log$ dpkg -l | grep nvidia-container-toolkit
ii  nvidia-container-toolkit               1.15.0-1                              amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base          1.15.0-1                              amd64        NVIDIA Container Toolkit Base
ubuntu@ip-10-1-5-148:/var/log$ cat /etc/nvidia-container-runtime/config.toml
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]

[nvidia-container-runtime.modes]

sounds like I will need to debug this further. I will report back when if I determine a root cause

nvvfedorov commented 4 months ago

@nghtm , Try to run the sample workload as suggested here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html#running-a-sample-workload-with-docker. This will tell us if the Nvidia runtime configured correctly or not.

nghtm commented 4 months ago

We are installing nvidia-container-toolkit on the node via this script:

The docker configuration defaults to:

{
    "data-root": "/opt/dlami/nvme/docker/data-root"
}

But I can typically run nvidia commands via docker with this. For example: sudo docker run --rm --gpus all ubuntu nvidia-smi works.

However when I try launching the dcgmi container and tracking docker logs, it fails after about 1 minute:

docker logs 92c05c0f81ba
time="2024-04-30T22:16:32Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T22:16:32Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T22:16:33Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T22:16:33Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-golden-metrics.csv'"
time="2024-04-30T22:16:33Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T22:17:15Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"
nghtm commented 4 months ago

Trying to go back to the base dcgm-exporter container, which uses etc/dcgm-exporter/dcp-metrics-included.csv instead of the custom CSV file I have writen, to see if that fixes the container.

    sudo docker run -d --rm \
       --gpus all \
       --net host \
       --cap-add SYS_ADMIN \
       nvcr.io/nvidia/k8s/dcgm-exporter:${DCGM_EXPORTER_VERSION}-ubuntu20.04 \
       -f /etc/dcgm-exporter/dcp-metrics-included.csv 
nghtm commented 4 months ago

For reference, this is the install script for dcgm exporter which has been causing the container failures on g5.48xlarge (a10 GPUs)

https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/install_dcgm_exporter.sh

nghtm commented 4 months ago

It seems to be working without issues on h100s, so perhaps some of the custom metrics are not available on a10s (just a hypothesis)

nghtm commented 4 months ago

Repeated error trying to run container on a10 GPUs, but it works on h100 GPUs.

On a10s, the docker logs show:

ubuntu@ip-10-1-5-148:~$ docker logs ca88122482d5
time="2024-04-30T23:14:28Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T23:14:28Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T23:14:29Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T23:14:29Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-04-30T23:14:29Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T23:15:06Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"

on h100s, the docker logs show

ubuntu@ip-10-1-22-213:~$ docker logs 01a9236f1495
time="2024-04-30T23:05:43Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T23:05:43Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T23:05:43Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T23:05:43Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-golden-metrics.csv'"
time="2024-04-30T23:05:43Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T23:05:46Z" level=info msg="Pipeline starting"
time="2024-04-30T23:05:46Z" level=info msg="Starting webserver"
level=info ts=2024-04-30T23:05:46.033Z caller=tls_config.go:313 msg="Listening on" address=[::]:9400
level=info ts=2024-04-30T23:05:46.034Z caller=tls_config.go:316 msg="TLS is disabled." http2=false address=[::]:9400
nghtm commented 4 months ago

Reporting findings from today:

h100 nodes (8x GPU) no issue, all versions of DCGM exporter appear to be working a10 nodes (8x GPU) older version of dcgm works 2.1.4-2.3.1-ubuntu20.04

All versions above 3.1.6-3.1.3-ubuntu20.04 are failing, docker logs show the following:

level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"
nghtm commented 4 months ago

Root cause determined it is an issue with the OS version of Nvidia Driver 535.161.08 with the g5.48xlarge (8x a10) instances and Nvidia DCGM 3.3.5-3.4.0-ubuntu22.04

We were able to run DCGM-Exporter installing the proprietary driver 535.161.08 or by using 2.1.4-2.3.1-ubuntu20.04, but 3.3.5-3.4.0-ubuntu22.04 was failing consistently with the OS Driver on g5.48xlarge, represented by GSP errors in dmesg.

Similar to this issue reporter here: https://github.com/awslabs/amazon-eks-ami/issues/1523

Anyways, thanks for the help and quick responses

nvvfedorov commented 4 months ago

@nghtm Thank you for the update. I am closing the issue as solved.