XuehaiPan / nvitop

An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
https://nvitop.readthedocs.io
Apache License 2.0
4.56k stars 144 forks source link

[BUG] Prometheus connection refused #121

Closed FernandezR closed 6 months ago

FernandezR commented 6 months ago

Required prerequisites

What version of nvitop are you using?

1.3.2

Operating system and version

Ubuntu 22.04

NVIDIA driver version

535.154.05

NVIDIA-SMI

Tue Feb 20 20:20:41 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060        Off | 00000000:01:00.0 Off |                  N/A |
| N/A   49C    P0              37W / 115W |    431MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1315      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A   2126770      C   ...lib/plexmediaserver/Plex Transcoder      422MiB |
+---------------------------------------------------------------------------------------+

Python environment

3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] linux nvidia-ml-py==12.535.133 nvitop==1.3.2 nvitop-exporter==1.3.2

Problem description

I am trying to use nvitop-exporter with prometheus but prometheus keeps getting a connection refused when trying to connect to the exporter.

Steps to Reproduce

nvitop-exporter

Prometheus config

  - job_name: 'nvitop_exporter'
    static_configs:
      - targets: ['localhost:8000']

Traceback

No response

Logs

No response

Expected behavior

I expected prometheus to be able to connect on port 8000 to nvitop-exporter.

Additional context

nvitop-exporter shows no errors an says it has connected on port 8000.

XuehaiPan commented 6 months ago

Hi @FernandezR, have you ever tried to use another bind address?

nvitop-exporter --bind-address 0.0.0.0 --port 8000
FernandezR commented 6 months ago

I tried it, but it still doesn't work. Does it work for you?

Is my scrape config for prometheus incorrect?

Here is the command and output: nvitop-exporter --bind-address 0.0.0.0 --port 8008 INFO: Found 1 device(s). INFO: GPU 0: NVIDIA GeForce RTX 2060 INFO: Start the exporter on [host_ip] at http://0.0.0.0:8008/metrics.

Scrape config:

This is the error prometheus shows. Get "http://0.0.0.0:8008/metrics": dial tcp 0.0.0.0:8008: connect: connection refused

FernandezR commented 6 months ago

Also, I have prometheus running on docker but have no issue collecting stats from other exporters.

I can curl the metrics when the exporter is run locally but it doesn't work when I wrap in a docker image: https://github.com/FernandezR/nvitop-Exporter-Docker

FernandezR commented 6 months ago

It looks like it was the bind address that was causing an issue. I think the requests from Prometheus are coming from the docker IP address, which is being refused. I need to test a few things later to confirm that I have resolved the issue.