Closed nghtm closed 6 months ago
The first set of numbers in the DCGM-Exporter version correspond to the DCGM library version used in the container and in testing (3.3.5 in your case). The second set of numbers (3.4.0) correspond to the DCGM-Exporter version. However, DCGM follows semver compatibility guidelines so any 3.x version should be compatible.
Thank you for the response, helpful info on versions. :-)
When I try running this container with DCGM_EXPORTER_VERSION=3.3.5-3.4.0-ubuntu22.04
and dcgmi -v = 3.3.5
it fails, causing Nvidia-SMI to throw errors on gpu 0. Prior to running the container, Nvidia-smi showed all GPUs to be healthy. I examined nvidia-bug-report and found the following message:
Apr 30 21:13:04 ip-10-1-5-148 dockerd[10261]: time="2024-04-30T21:13:04.829815111Z" level=error msg="restartmanger wait error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: detection error: nvml error: unknown error: unknown"
For GPU 0 which shows ERR!, NVSMI Log shows:
==============NVSMI LOG==============
Timestamp : Tue Apr 30 21:16:21 2024
Driver Version : 535.161.08
CUDA Version : 12.2
Attached GPUs : 8
GPU 00000000:00:16.0
Product Name : NVIDIA A10G
Product Brand : Unknown Error
Product Architecture : Ampere
Display Mode : N/A
Display Active : N/A
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : Unknown Error
Pending : Unknown Error
Accounting Mode : N/A
Accounting Mode Buffer Size : N/A
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1652222014738
GPU UUID : Unknown Error
Minor Number : 0
VBIOS Version : Unknown Error
MultiGPU Board : N/A
Board ID : N/A
Board Part Number : 900-2G133-A840-100
GPU Part Number : 2237-892-A1
FRU Part Number : N/A
Module ID : Unknown Error
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.161.08
GPU Virtualization Mode
Virtualization Mode : N/A
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : N/A
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x00
Device : 0x16
Domain : 0x0000
Device Id : 0x223710DE
Bus Id : 00000000:00:16.0
Sub System Id : 0x152F10DE
GPU Link Info
PCIe Generation
Max : N/A
Current : N/A
Device Current : N/A
Device Max : N/A
Host Max : N/A
Link Width
Max : N/A
Current : N/A
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : Unknown Error
Replay Number Rollovers : Unknown Error
Tx Throughput : Unknown Error
Rx Throughput : Unknown Error
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : Unknown Error
Performance State : Unknown Error
Clocks Event Reasons : N/A
Sparse Operation Mode : Unknown Error
FB Memory Usage
Total : 23028 MiB
Reserved : 512 MiB
Used : 0 MiB
Free : 22515 MiB
BAR1 Memory Usage
Total : N/A
Used : N/A
Free : N/A
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : N/A
Memory : N/A
Encoder : N/A
Decoder : N/A
JPEG : N/A
OFA : N/A
Encoder Stats
Active Sessions : N/A
Average FPS : N/A
Average Latency : N/A
FBC Stats
Active Sessions : N/A
Average FPS : N/A
Average Latency : N/A
ECC Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
SRAM Threshold Exceeded : N/A
Aggregate Uncorrectable SRAM Sources
SRAM L2 : N/A
SRAM SM : N/A
SRAM Microcontroller : N/A
SRAM PCIE : N/A
SRAM Other : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : Unknown Error
Temperature
GPU Current Temp : Unknown Error
GPU T.Limit Temp : Unknown Error
GPU Shutdown T.Limit Temp : Unknown Error
GPU Slowdown T.Limit Temp : Unknown Error
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating T.Limit Temp : Unknown Error
GPU Power Readings
Power Draw : N/A
Current Power Limit : 670166.31 W
Requested Power Limit : 0.00 W
Default Power Limit : Unknown Error
Min Power Limit : Unknown Error
Max Power Limit : Unknown Error
Module Power Readings
Power Draw : Unknown Error
Current Power Limit : Unknown Error
Requested Power Limit : 0.00 W
Default Power Limit : Unknown Error
Min Power Limit : Unknown Error
Max Power Limit : Unknown Error
Clocks
Graphics : N/A
SM : N/A
Memory : N/A
Video : N/A
Applications Clocks
Graphics : Unknown Error
Memory : Unknown Error
Default Applications Clocks
Graphics : Unknown Error
Memory : Unknown Error
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : N/A
SM : N/A
Memory : N/A
Video : N/A
Max Customer Boost Clocks
Graphics : Unknown Error
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : Unknown Error
Fabric
State : N/A
Status : N/A
Processes : None
You need to install and configure the NVIDIA Container Toolkit, it seems, that it is not configured correctly and that is why you see the error:
Apr 30 21:13:04 ip-10-1-5-148 dockerd[10261]: time="2024-04-30T21:13:04.829815111Z" level=error msg="restartmanger wait error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: detection error: nvml error: unknown error: unknown"
Thanks for the response.
nvidia-container-toolkit is installed.
ubuntu@ip-10-1-5-148:/var/log$ dpkg -l | grep nvidia-container-toolkit
ii nvidia-container-toolkit 1.15.0-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.15.0-1 amd64 NVIDIA Container Toolkit Base
ubuntu@ip-10-1-5-148:/var/log$ cat /etc/nvidia-container-runtime/config.toml
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]
[nvidia-container-runtime.modes]
sounds like I will need to debug this further. I will report back when if I determine a root cause
@nghtm , Try to run the sample workload as suggested here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html#running-a-sample-workload-with-docker. This will tell us if the Nvidia runtime configured correctly or not.
We are installing nvidia-container-toolkit on the node via this script:
The docker configuration defaults to:
{
"data-root": "/opt/dlami/nvme/docker/data-root"
}
But I can typically run nvidia commands via docker with this. For example: sudo docker run --rm --gpus all ubuntu nvidia-smi
works.
However when I try launching the dcgmi container and tracking docker logs
, it fails after about 1 minute:
docker logs 92c05c0f81ba
time="2024-04-30T22:16:32Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T22:16:32Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T22:16:33Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T22:16:33Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-golden-metrics.csv'"
time="2024-04-30T22:16:33Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T22:17:15Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"
Trying to go back to the base dcgm-exporter container, which uses etc/dcgm-exporter/dcp-metrics-included.csv
instead of the custom CSV file I have writen, to see if that fixes the container.
sudo docker run -d --rm \
--gpus all \
--net host \
--cap-add SYS_ADMIN \
nvcr.io/nvidia/k8s/dcgm-exporter:${DCGM_EXPORTER_VERSION}-ubuntu20.04 \
-f /etc/dcgm-exporter/dcp-metrics-included.csv
For reference, this is the install script for dcgm exporter which has been causing the container failures on g5.48xlarge (a10 GPUs)
It seems to be working without issues on h100s, so perhaps some of the custom metrics are not available on a10s (just a hypothesis)
Repeated error trying to run container on a10 GPUs, but it works on h100 GPUs.
On a10s, the docker logs show:
ubuntu@ip-10-1-5-148:~$ docker logs ca88122482d5
time="2024-04-30T23:14:28Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T23:14:28Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T23:14:29Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T23:14:29Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-04-30T23:14:29Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T23:15:06Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"
on h100s, the docker logs show
ubuntu@ip-10-1-22-213:~$ docker logs 01a9236f1495
time="2024-04-30T23:05:43Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T23:05:43Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T23:05:43Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T23:05:43Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-golden-metrics.csv'"
time="2024-04-30T23:05:43Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T23:05:46Z" level=info msg="Pipeline starting"
time="2024-04-30T23:05:46Z" level=info msg="Starting webserver"
level=info ts=2024-04-30T23:05:46.033Z caller=tls_config.go:313 msg="Listening on" address=[::]:9400
level=info ts=2024-04-30T23:05:46.034Z caller=tls_config.go:316 msg="TLS is disabled." http2=false address=[::]:9400
Reporting findings from today:
h100 nodes (8x GPU) no issue, all versions of DCGM exporter appear to be working
a10 nodes (8x GPU)
older version of dcgm works 2.1.4-2.3.1-ubuntu20.04
All versions above 3.1.6-3.1.3-ubuntu20.04
are failing, docker logs show the following:
level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"
Root cause determined it is an issue with the OS version of Nvidia Driver 535.161.08 with the g5.48xlarge (8x a10) instances and Nvidia DCGM 3.3.5-3.4.0-ubuntu22.04
We were able to run DCGM-Exporter installing the proprietary driver 535.161.08 or by using 2.1.4-2.3.1-ubuntu20.04, but 3.3.5-3.4.0-ubuntu22.04 was failing consistently with the OS Driver on g5.48xlarge, represented by GSP errors in dmesg.
Similar to this issue reporter here: https://github.com/awslabs/amazon-eks-ami/issues/1523
Anyways, thanks for the help and quick responses
@nghtm Thank you for the update. I am closing the issue as solved.
Ask your question
Hi,
I am hoping to understand the difference between the
dcgmi -v
version and the version ofdcgm exporter
which should be used.I want to undertstand what version of dcgm exporter I should specify for my docker container. When I run the following, I see dcgmi version = 3.3.5
When I create my docker container, what verison should I specify?