NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
751 stars 138 forks source link

Profiling module failed to load #328

Open hkominos opened 1 month ago

hkominos commented 1 month ago

What is the version?

3.3.6-3.4.2-ubuntu22.04

What happened?

On my host, which has a number of MIG backed GPUs I have tried to make the DCGM-exporter work with only partial success.

I have tried multiple docker tags as well as multiple arguments.

on the host we find

nvidia-mig-manager-0.5.1-1.x86_64
nvidia-container-toolkit-base-1.15.0-1.x86_64
libnvidia-container1-1.15.0-1.x86_64
libnvidia-container-tools-1.15.0-1.x86_64
nvidia-container-toolkit-1.15.0-1.x86_64
datacenter-gpu-manager-3.3.6-1.x86_64

Once the docker container is spawned it is connecting to the nv-hostengine running on the host.

However on the host I can see devices with nvidia-smi https://pastebin.com/4rjWHMG4 But from within the container I do not! https://pastebin.com/4rjWHMG4

in nv-hostengine logs I see a number of Errors but few pointers as to where to start. Have I installed the correct binaries? When using MIG Gpus is the --all flag enough? I would appreciate some input. Perhaps I am just spawning it wrongly.

2024-05-21 14:34:39.463 ERROR [3218460:3218461] GetLatestSample returned No data is available for entityId 7 groupId 4 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]                                                                                                                      
2024-05-21 14:34:39.463 ERROR [3218460:3218461] GetLatestSample returned Feature not supported for entityId 7 groupId 4 fieldId 449 [/workspaces/dcgm-rel_dcgm_3_3-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]   

The full log is https://pastebin.com/heqXkeV2

Finaly I see also

root@computegpu002:/# dcgmi modules -l  
+-----------+--------------------+--------------------------------------------------+
| List Modules                                                                      |
| Status: Success                                                                   |
+===========+====================+==================================================+
| Module ID | Name               | State                                            |
+-----------+--------------------+--------------------------------------------------+
| 0         | Core               | Loaded                                           |
| 1         | NvSwitch           | Loaded                                           |
| 2         | VGPU               | Not loaded                                       |
| 3         | Introspection      | Not loaded                                       |
| 4         | Health             | Not loaded                                       |
| 5         | Policy             | Not loaded                                       |
| 6         | Config             | Not loaded                                       |
| 7         | Diag               | Not loaded                                       |
| 8         | Profiling          | Failed to load                                   |
| 9         | SysMon             | Failed to load                                   |
+-----------+--------------------+--------------------------------------------------+

I would appreciate some pointers. Thank you

What did you expect to happen?

To be able to view my MIG backed GPUs from within the docker container. In order to get some details about their usage.

What is the GPU model?

A-100 80GB which is split in multiple MIG backed vGPUs.

What is the environment?

I am using the docker container found https://hub.docker.com/r/nvidia/dcgm-exporter/ Rocky linux Host

How did you deploy the dcgm-exporter and what is the configuration?

Pull from docker and run :

docker run -d --privileged --gpus all --net host -p 9400:9400 --name my-gpus --cap-add SYS_ADMIN -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_MIG_CONFIG_DEVICES=all -e NVIDIA_MIG_MONITOR_DEVICES=all --mount type=bind,source=/root/gpuconfig,target=/test 10.10.10.10:80/nvidia/dcgm_exporter:3.3.6-3.4.2-ubuntu22.04 -r localhost:5555 -f /test/dcp-test.csv

How to reproduce the issue?

Install the nvidia driver, the toolkit and then the gpu-manager binary. Spawn the container and run nvidia-smi on a host with GPUs.

Anything else we need to know?

No Gpus are seen from within the container. https://pastebin.com/EJNa5vRv

docker logs:

2024/05/21 13:47:39 maxprocs: Leaving GOMAXPROCS=128: CPU quota undefined time="2024-05-21T13:47:39Z" level=info msg="Starting dcgm-exporter" time="2024-05-21T13:47:39Z" level=info msg="Attemping to connect to remote hostengine at localhost:5555" time="2024-05-21T13:47:39Z" level=info msg="DCGM successfully initialized!" time="2024-05-21T13:47:39Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded" time="2024-05-21T13:47:39Z" level=info msg="Falling back to metric file '/test/dcp-ecmwf.csv'" time="2024-05-21T13:47:39Z" level=warning msg="Skipping line 25 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled" time="2024-05-21T13:47:39Z" level=warning msg="Skipping line 26 ('DCGM_FI_PROF_SM_ACTIVE'): metric not enabled" time="2024-05-21T13:47:39Z" level=warning msg="Skipping line 27 ('DCGM_FI_PROF_SM_OCCUPANCY'): metric not enabled"

nvvfedorov commented 1 month ago

@hkominos, thank you for reporting the issue. Please replace pastebin.com with GIST, unfortunately, links shared on Pastebin don't work for us.

hkominos commented 1 month ago

Thank you for your input @nvvfedorov . I have created a large GIST with all the logs here: https://gist.github.com/hkominos/adbbdf1501e4b65c309877d308e71214

nvvfedorov commented 1 month ago

@hkominos , can you show us content of test/dcp-test.csv or /test/dcp-ecmwf.csv ? We need to know the list of metrics, that were enabled.

hkominos commented 1 month ago

Of course @nvvfedorov . I just commented out some lines from the dcp-metrics-included.csv https://gist.github.com/hkominos/b32957473070cc0e9c29fb7a1baf2613

xiaoyu1095 commented 1 month ago

I encountered the same issue in two different versions.

version1: chart: https://github.com/nvidia/dcgm-exporter 3.4.2
docker: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 version2: chart:
https://github.com/nvidia/dcgm-exporter 3.1.7 docker: nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.7-ubuntu20.04

image image