A description of the problem.
dcgmi policy about Reset GPU Not effective
Steps to reproduce the issue.
root@dcgm-image-4090:~# dcgmi policy -g 0 --get -v
Policy information
+-----------------------------+------------------------------------------------+
| Policy Information |
| GPU ID: 0 |
+=============================+================================================+
| Violation conditions | XID error detected. |
| Isolation mode | Manual |
| Action on violation | Reset GPU |
| Validation after action | System Validation (Short) |
| Validation failure action | None |
+-----------------------------+------------------------------------------------+
root@dcgm-image-4090:~#
root@dcgm-image-4090:~# dcgmi test --inject --gpuid 0 -f 230 -v 1
Successfully injected field info.
root@dcgm-image-4090:~#
root@dcgm-image-4090:~#
root@dcgm-image-4090:~# dcgmi test --inject --gpuid 0 -f 230 -v 1
Successfully injected field info.
root@dcgm-image-4090:~#
but dmesg did not set gpu reset
Relevant configuration information
bare metal environment
root@dcgm-image-4090:~# nvidia-smi
Wed Aug 28 03:45:20 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:00:05.0 Off | Off |
| 37% 29C P8 6W / 450W | 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@dcgm-image-4090:~#
root@dcgm-image-4090:~# dcgmi -v
Version : 3.3.7
Build ID : 26
Build Date : 2024-07-09
Build Type : Release
Commit ID : 105620196e46a7ef2f99a1ce3e69a5d12af1e845
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : c1b74febf52d45d29ae956b78c091857
Hostengine build info:
Version : 3.3.7
Build ID : 26
Build Date : 2024-07-09
Build Type : Release
Commit ID : 105620196e46a7ef2f99a1ce3e69a5d12af1e845
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : c1b74febf52d45d29ae956b78c091857
root@dcgm-image-4090:~#
A description of the problem. dcgmi policy about Reset GPU Not effective Steps to reproduce the issue. root@dcgm-image-4090:~# dcgmi policy -g 0 --get -v Policy information +-----------------------------+------------------------------------------------+ | Policy Information | | GPU ID: 0 | +=============================+================================================+ | Violation conditions | XID error detected. | | Isolation mode | Manual | | Action on violation | Reset GPU | | Validation after action | System Validation (Short) | | Validation failure action | None | +-----------------------------+------------------------------------------------+ root@dcgm-image-4090:~# root@dcgm-image-4090:~# dcgmi test --inject --gpuid 0 -f 230 -v 1 Successfully injected field info. root@dcgm-image-4090:~# root@dcgm-image-4090:~# root@dcgm-image-4090:~# dcgmi test --inject --gpuid 0 -f 230 -v 1 Successfully injected field info. root@dcgm-image-4090:~#
but dmesg did not set gpu reset
Relevant configuration information bare metal environment
root@dcgm-image-4090:~# nvidia-smi Wed Aug 28 03:45:20 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:00:05.0 Off | Off | | 37% 29C P8 6W / 450W | 0MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ root@dcgm-image-4090:~#
root@dcgm-image-4090:~# dcgmi -v Version : 3.3.7 Build ID : 26 Build Date : 2024-07-09 Build Type : Release Commit ID : 105620196e46a7ef2f99a1ce3e69a5d12af1e845 Branch Name : rel_dcgm_3_3 CPU Arch : x86_64 Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64 CRC : c1b74febf52d45d29ae956b78c091857
Hostengine build info: Version : 3.3.7 Build ID : 26 Build Date : 2024-07-09 Build Type : Release Commit ID : 105620196e46a7ef2f99a1ce3e69a5d12af1e845 Branch Name : rel_dcgm_3_3 CPU Arch : x86_64 Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64 CRC : c1b74febf52d45d29ae956b78c091857 root@dcgm-image-4090:~#