BugRoger / nvidia-exporter

Prometheus Exporter for NVIDIA GPUs using NVML
Apache License 2.0
71 stars 23 forks source link

Failed to collect metrics: nvml: Not Supported #3

Open Cherishty opened 5 years ago

Cherishty commented 5 years ago

Hi @BugRoger

When starting the exporter in k8s, the log alway says:

Failed to collect metrics: nvml: Not Supported

And below is the result of nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.59                 Driver Version: 390.59                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000460E:00:00.0 Off |                    0 |
| N/A   37C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00006180:00:00.0 Off |                    0 |
| N/A   33C    P8    33W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

While this error does not occur on another GPU machine which using GTX 1080.

Any clues or suggestion?

Cherishty commented 5 years ago

Additionally, compared with a similar gpu-exporter, I find it meets the same using tesla issue, while it still can work.

And seems it has claimed by nvidia officially:

https://github.com/NVIDIA/nvidia-docker/issues/40 https://github.com/ComputationalRadiationPhysics/cuda_memtest/issues/16

So can we unblock it ?

jackpgao commented 4 years ago

+1

nvml: Not Supported

auto-testing commented 4 years ago

+1 Tesla: 2019/10/25 11:24:19 Failed to collect metrics: nvml: Not Supported

| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|

On GTX 1060, 1070 works fine.

ashleyprimo commented 4 years ago

Will be submitting a PR shortly; however to quickly explain the issue. It looks like not all metrics are supported on Tesla graphics card via NVML (there are likely other GPUs also) - however the exporter handles this by returning out instead of continuing to collect other metrics.

Example:

        fanSpeed, err := device.FanSpeed()
        if err != nil {
            return nil, err
        }

Instead of return nil, err we should just log (catch) the event - and do something that does not interrupt the routine.

So currently, as seen on you're nvidia-smi output's anything that has N/A would currently cause the above error and interrupt collection.