BugRoger / nvidia-exporter

Prometheus Exporter for NVIDIA GPUs using NVML
Apache License 2.0
73 stars 23 forks source link

Failed to collect metrics: could not load NVML library #1

Open zh168654 opened 6 years ago

zh168654 commented 6 years ago

This is my deployment:

apiVersion: apps/v1beta1
kind: Deployment

metadata:
  name: nvidia-exporter
  namespace: monitoring
spec:
  replicas: 1
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: nvidia-exporter
    spec:
      containers:
        - name: nvidia-exporter
          securityContext:
            privileged: true
          image: bugroger/nvidia-exporter:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 9401
          volumeMounts:
            - mountPath: /usr/local/nvidia
              name: nvidia
      volumes:
        - name: nvidia
          hostPath:
            path: /home/zy/cuda

when I exec into nvidia-exporter and run

ls /usr/local/nvidia/lib64

there exists libnvidia-ml.so.1\ But the container logs always show

Failed to collect metrics: could not load NVML library

Cherishty commented 5 years ago

@zh168654 have you find any workaround or clues? I am facing a similar error which says:

Failed to collect metrics: nvml: Not Supported

My Driver Version is : 390.59, GPU is Tesla K80.

While this error does NOT occur on other env whose GPU is GTX 1080

SjhZju commented 5 years ago

hi,

@zh168654 have you find any workaround or clues? I am facing a similar error which says:

Failed to collect metrics: nvml: Not Supported

My Driver Version is : 390.59, GPU is Tesla K80.

While this error does NOT occur on other env whose GPU is GTX 1080

hi, I have the same problem. I think it is the reason why exporter can not get metrics. My Driver Version is 390.48, with two GTX 980. Server Os is Ubuntu 16.04

bmerry commented 2 years ago

I'm running into the same problem. I suspect it's because the Docker image is built with Alpine (and hence musl libc) while Nvidia's NVML library (libnvidia-ml.so) depends on glibc.