Closed BlueskyFR closed 1 year ago
Duplicate of #32, #63.
Could it be something like 0.1 sec
Hi @BlueskyFR, the latency from the NVML API call is relatively high. I think it's meaningless to support small intervals like 0.1 second. If you want a fine-grained report of resource usage, maybe you should use a profiler instead.
so that we can get a more accurate overview of what is happening on the GPU?
<Enter>
key. The metrics on the top row will refresh every 1/4 sec.
Watch metrics for a specific process (shortcut: Enter / Return).
nvitop.ResourceMetricCollector
, see Resource Metric Collector for more information.Thanks for your reply. Why are calls to NVML so slow?
nvidia-smi
supports a resolution up to a 10ms refresh rate for instance
Why are calls to NVML so slow?
nvidia-smi
supports a resolution up to a 10ms refresh rate for instance
@BlueskyFR nvidia-smi
cannot achieve this.
nvidia-smi
command will take more time (up to seconds (e.g., 3s)) to do a single query.We can "refresh" the "fake" results every 10ms. But the results may be queried seconds ago. They are not accurate.
Here are some benchmark results from my side. You can try hyperfine
on your machine to see the latency.
$ hyperfine --warmup 50 --runs 200 nvidia-smi
Benchmark 1: nvidia-smi
Time (mean ± σ): 113.6 ms ± 8.4 ms [User: 5.3 ms, System: 3.9 ms]
Range (min … max): 98.4 ms … 141.4 ms 200 runs
$ hyperfine --warmup 50 --runs 200 nvidia-smi
Benchmark 1: nvidia-smi
Time (mean ± σ): 1.920 s ± 0.417 s [User: 0.007 s, System: 1.298 s]
Range (min … max): 1.314 s … 4.250 s 200 runs
It takes 2 seconds to do a single query. It cannot run under 10ms.
Why are calls to NVML so slow?
nvidia-smi
supports a resolution up to a 10ms refresh rate for instance@BlueskyFR
nvidia-smi
cannot achieve this.
- It depends on how many GPU devices are on board.
- If the persistence mode is disabled, the
nvidia-smi
command will take more time (up to seconds (e.g., 3s)) to do a single query.We can "refresh" the "fake" results every 10ms. But the results may be queried seconds ago. They are not accurate.
Here are some benchmark results from my side. You can try
hyperfine
on your machine to see the latency.
- Single NVIDIA 3090 GPU on WSL (persistence mode enabled)
$ hyperfine --warmup 50 --runs 200 nvidia-smi Benchmark 1: nvidia-smi Time (mean ± σ): 113.6 ms ± 8.4 ms [User: 5.3 ms, System: 3.9 ms] Range (min … max): 98.4 ms … 141.4 ms 200 runs
- 8 x NVIDIA A100 GPU on native Ubuntu (persistence mode enabled)
$ hyperfine --warmup 50 --runs 200 nvidia-smi Benchmark 1: nvidia-smi Time (mean ± σ): 1.920 s ± 0.417 s [User: 0.007 s, System: 1.298 s] Range (min … max): 1.314 s … 4.250 s 200 runs
It takes 2 seconds to do a single query. It cannot run under 10ms.
You are maybe using it wrong 😊
You can see my post here for more details -> https://github.com/influxdata/telegraf/issues/8534#issue-761112264
You are maybe using it wrong 😊
You can see my post here for more details -> https://github.com/influxdata/telegraf/issues/8534#issue-761112264
Thanks for the reference. nvitop
already uses sparse queries with nvidia-ml-py
instead of a full query using nvidia-smi
. But there are still many things that are slow here. Such as gathering process information, especially when the process number is relatively large (up to hundreds). Also, as I mentioned above, if you don't enable the persistence mode, your nvidia-smi
query will take a much longer time.
So I think maybe it is more a design problem? Maybe the same quantity of information cannot be achieved with nvidia-smi but I doubt it
So I think maybe it is more a design problem? Maybe the same quantity of information cannot be achieved with nvidia-smi but I doubt it
In your example, you are not querying process information, which is the key feature of nvitop
. If you want accurate metrics data, I still think you should use a profiler instead. A day-to-day monitor should not run under high sample frequency for 7x24. That will lead to high power consumption. If you want to monitor a process for only several minutes, why not use a profiler? It should be the more appropriate tool for your use case.
Could be a solution, what profiler do you have in mind for instance?
Could be a solution, what profiler do you have in mind for instance?
@BlueskyFR That depends on your use case because profilers need an in-process injection to add hooks to record kernel times. This may need users to update their code. If you are using PyTorch, you may try torch.profiler.profile
(pytorch/kineto). It can collect fine-grained metrics and also come with a web-based GUI integration. You may also try the NVIDIA Nsight Systems, a profiling tool from NVIDIA.
Required prerequisites
Motivation
I see the current minimum refresh rate is 1 second. Could it be something like 0.1 sec so that we can get a more accurate overview of what is happening on the GPU?
Solution
-
Alternatives
-
Additional context
-