NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
393 stars 50 forks source link

dcgmi policy about Reset GPU Not effective #185

Open mr-j-1992 opened 1 month ago

mr-j-1992 commented 1 month ago

A description of the problem. dcgmi policy about Reset GPU Not effective Steps to reproduce the issue. root@dcgm-image-4090:~# dcgmi policy -g 0 --get -v Policy information +-----------------------------+------------------------------------------------+ | Policy Information | | GPU ID: 0 | +=============================+================================================+ | Violation conditions | XID error detected. | | Isolation mode | Manual | | Action on violation | Reset GPU | | Validation after action | System Validation (Short) | | Validation failure action | None | +-----------------------------+------------------------------------------------+ root@dcgm-image-4090:~# root@dcgm-image-4090:~# dcgmi test --inject --gpuid 0 -f 230 -v 1 Successfully injected field info. root@dcgm-image-4090:~# root@dcgm-image-4090:~# root@dcgm-image-4090:~# dcgmi test --inject --gpuid 0 -f 230 -v 1 Successfully injected field info. root@dcgm-image-4090:~#

but dmesg did not set gpu reset

Relevant configuration information bare metal environment

root@dcgm-image-4090:~# nvidia-smi Wed Aug 28 03:45:20 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:00:05.0 Off | Off | | 37% 29C P8 6W / 450W | 0MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ root@dcgm-image-4090:~#

root@dcgm-image-4090:~# dcgmi -v Version : 3.3.7 Build ID : 26 Build Date : 2024-07-09 Build Type : Release Commit ID : 105620196e46a7ef2f99a1ce3e69a5d12af1e845 Branch Name : rel_dcgm_3_3 CPU Arch : x86_64 Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64 CRC : c1b74febf52d45d29ae956b78c091857

Hostengine build info: Version : 3.3.7 Build ID : 26 Build Date : 2024-07-09 Build Type : Release Commit ID : 105620196e46a7ef2f99a1ce3e69a5d12af1e845 Branch Name : rel_dcgm_3_3 CPU Arch : x86_64 Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64 CRC : c1b74febf52d45d29ae956b78c091857 root@dcgm-image-4090:~#