Open corrtia opened 3 weeks ago
I ran a dcgm container using nvcr.io/nvidia/cloud-native/dcgm:3.3.6-1-ubuntu22.04.
docker run --gpus all -p 5554:5555 nvcr.io/nvidia/cloud-native/dcgm:3.3.6-1-ubuntu22.04
I think I ran the following command in the container, and then the following error occurred:
dcgmi health --check -g 1 Error: Health watches not enabled. Please enable watches.
The gpu environment :
nvidia-smi Fri Jun 28 09:14:18 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.4 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla V100-PCIE-32GB Off | 00000000:1A:00.0 Off | 0 | | N/A 32C P0 23W / 250W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE-32GB Off | 00000000:1E:00.0 Off | 0 | | N/A 32C P0 24W / 250W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 Tesla V100-PCIE-32GB Off | 00000000:3D:00.0 Off | 0 | | N/A 32C P0 24W / 250W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 Tesla V100-PCIE-32GB Off | 00000000:42:00.0 Off | 0 | | N/A 32C P0 24W / 250W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
I ran a dcgm container using nvcr.io/nvidia/cloud-native/dcgm:3.3.6-1-ubuntu22.04.
I think I ran the following command in the container, and then the following error occurred:
The gpu environment :