XuehaiPan / nvitop

An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
https://nvitop.readthedocs.io
Apache License 2.0
4.56k stars 144 forks source link

[BUG][exporter] Process metrics still exist when the process is gone #106

Closed caotangdaiduong closed 9 months ago

caotangdaiduong commented 9 months ago

Required prerequisites

What version of nvitop are you using?

1.3.1

Operating system and version

Ubuntu 20.04.4 LTS

NVIDIA driver version

510.47.03

NVIDIA-SMI

Wed Nov 22 16:23:39 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+

Python environment

3.8.10 (default, May 26 2023, 14:05:08) [GCC 9.4.0] linux nvidia-ml-py==12.535.133 nvitop==1.3.1 nvitop-exporter==1.3.1

Problem description

nvitop-exporter cache value

Metric values are retained and not refreshed

Steps to Reproduce

The Python snippets (if any):

Command lines:

Traceback

No response

Logs

No response

Expected behavior

No response

Additional context

No response

XuehaiPan commented 9 months ago

Metric values are retained and not refreshed

Hi @caotangdaiduong, do you set up a prometheus service to retrieve the latest metrics automatically?

caotangdaiduong commented 9 months ago

And currently I'm using cron to restart the service every minute, this may sound crazy but the metric is completely accurate.

XuehaiPan commented 9 months ago

I know by default nvitop default interval is 1s but I have added the interval option with different values like 15s, 30s but the result is still the same.

@caotangdaiduong I can see the metrics are updating on my side. I'm running watch --differences:

watch --differences 'curl -s http://127.0.0.1:8000/metrics'

This is similar to pushgateway, it only updates the value with the last key name and if there is a new key, there will be new values. I think it's similar to the case with many different values (in my case, every time the PID, index is changed, it creates a new one, and the old PID, index is still there).

The metrics for GPU processes are actively updated on my side.

I can confirm if the GPU process is gone, the gauge keys still exist. Do you mean you want to remove these keys if the corresponding processes are gone?

XuehaiPan commented 9 months ago
  • You will see that both the old and new PIDs exist when calling curl to the exporter

@caotangdaiduong I can confirm this and opened a PR #107 to resolve this. You can try it via:

python3 -m pip install "git+https://github.com/XuehaiPan/nvitop.git@exporter-remove-gone-process#egg=nvitop-exporter&subdirectory=nvitop-exporter"
caotangdaiduong commented 9 months ago

Hi @XuehaiPan

Thanks for your efforts, I tested it and it works as expected