Cldfire / nvml-wrapper

Safe Rust wrapper for the NVIDIA Management Library
Apache License 2.0
135 stars 32 forks source link

process_utilization_stats failed with NOT_FOUND error, Ubuntu 22.04 #56

Open tubzby opened 8 months ago

tubzby commented 8 months ago
use nvml_wrapper::Nvml;

fn main() {
    let nvml = Nvml::init().unwrap();
    let device = nvml.device_by_index(0).unwrap();

    let st = device.process_utilization_stats(None).unwrap();
}

cargo run with error:

thread 'main' panicked at src/main.rs:7:53:
called `Result::unwrap()` on an `Err` value: NotFound
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

My device:

Fri Mar 15 07:01:16 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080 Ti     Off |   00000000:01:00.0 Off |                  N/A |
|  0%   43C    P8             24W /  350W |       1MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

It's quite strange here, the first call to nvmlDeviceGetProcessUtilization to retrieve proccess count returned 79 in my situation which should be 0.

Baughn commented 6 months ago

Some observations:

Baughn commented 6 months ago

Here's the relevant nvtop code. Looks pretty different: https://github.com/Syllo/nvtop/blob/0316ce19581c3d8543cf6aa312d1569c56ca754f/src/extract_gpuinfo_nvidia.c#L761

Baughn commented 6 months ago

Another observation: Processes appear to only be returned if they are running. An idle process doesn't end up in the array, unless it was non-idle very recently. This accounts for what happens if I set the timestamp -- it reduces the horizon.

Also means that swallowing the error (and returning []) should be a valid workaround.