The GPU usage shows huge % usage

shmalex commented 3 years ago

Hi, found this issue today.

The GPU usage jumps to huge % number (screenshot below).

My System: Ubuntu 18.04.5 LTS nvtop version 1.2.0

Screenshot from 2021-06-07 17-56-04

Please let me know what additional info would be helpfull.

P.$ Thanks for the tool!

XuehaiPan commented 3 years ago

It may caused by illegal memory access (#107). What's your driver version (nvidia-smi). Maybe you can try out the latest version (build from source).

XuehaiPan commented 3 years ago

As an alternative, you can try nvitop (written in Python):

sudo apt-get update
sudo apt-get install python3-dev python3-pip
pip3 install --user nvitop  # Or `sudo pip3 install nvitop`
nvitop -m

shmalex commented 3 years ago

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:03:00.0  On |                  N/A |
|  0%   44C    P8    21W / 300W |    733MiB / 11018MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2113      G   /usr/lib/xorg/Xorg                 18MiB |
|    0   N/A  N/A      2284      G   /usr/bin/gnome-shell               72MiB |
|    0   N/A  N/A      2811      G   /usr/lib/xorg/Xorg                364MiB |
|    0   N/A  N/A      2947      G   /usr/bin/gnome-shell               64MiB |
|    0   N/A  N/A      3411      G   ...AAAAAAAA== --shared-files       29MiB |
|    0   N/A  N/A      3692      G   ...AAAAAAAAA= --shared-files       60MiB |
|    0   N/A  N/A      4158      G   ...AAAAAAAA== --shared-files       14MiB |
|    0   N/A  N/A     23484      G   whatsie                            14MiB |
|    0   N/A  N/A     26046      G   ...AAAAAAAAA= --shared-files       84MiB |
+-----------------------------------------------------------------------------+

Syllo commented 3 years ago

Hi,

I couldn't find any reason in my code on the why such a huge value would be displayed. I checked as far back as driver 390.87 from the year 2018 and the function had the same prototype, so it doesn't seem to be the same problem as encountered in #107.

The value is suspicious though, it is the maximum unsigned value UINT_MAX, which makes me think that either there is a computation in the driver that returns -1 and it wraps to the max or that the information is not available and defaults as the maximum to indicate an error.

Could you please check if the patch in the branch fix_gpu_rate does the trick?

To do so:

git clone https://github.com/Syllo/nvtop.git
mkdir -p nvtop/build && cd nvtop/build
git checkout fix_gpu_rate
cmake ..
make

XuehaiPan commented 3 years ago

I have the same issue when a zombie process is consuming the GPU memory. And nvtop works as expected when I filter out those GPUs. But I think my situation is slightly different from @shmalex 's. As the image shown below, nvtop extracts all process names correctly.

On my machine, there are some zombie processes on the GPU caused by some issues of PyTorch:

Screenshot 1

Screenshot 2

I add the following lines to function gpuinfo_nvidia_get_process_utilization:

fprintf(stderr, "Count=%u\n", samples_count);
for (unsigned i = 0; i < samples_count; ++i) {
  fprintf(stderr, "PID=%u %%SM=%u %%ENC=%u %%DEC=%u TS=%llu\n", samples[i].pid, samples[i].smUtil, samples[i].encUtil, samples[i].decUtil, samples[i].timeStamp);
}
fprintf(stderr, "\n");

And run as nvtop -s 7 2>debug.txt.

I get the following outputs:

Count=100
PID=50554 %SM=37100 %ENC=50583 %DEC=50624 TS=217144956601725
PID=50021 %SM=38305 %ENC=49270 %DEC=50571 TS=158819300671603
PID=50017 %SM=0 %ENC=3313 %DEC=0 TS=49936
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
... # 94 lines same as above

The output from pynvml:

$ pip3 install nvidia-ml-py==11.450.51  # the official NVML Python Bindings (http://pypi.python.org/pypi/nvidia-ml-py/)
$ ipython3
Python 3.9.5 (default, May  3 2021, 15:11:33) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.24.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from pynvml import *

In [2]: nvmlInit()

In [3]: handle = nvmlDeviceGetHandleByIndex(7)

In [4]: for sample in nvmlDeviceGetProcessUtilization(handle, 0):
   ...:     print(sample)
   ...:     
c_nvmlProcessUtilizationSample_t(pid: 0, timeStamp: 0, smUtil: 0, memUtil: 0, encUtil: 0, decUtil: 0)
c_nvmlProcessUtilizationSample_t(pid: 0, timeStamp: 0, smUtil: 0, memUtil: 0, encUtil: 0, decUtil: 0)
c_nvmlProcessUtilizationSample_t(pid: 0, timeStamp: 0, smUtil: 0, memUtil: 0, encUtil: 0, decUtil: 0)
... # 97 lines same as above

The output of nvtop is different from the output of pynvml on the same GPU.

XuehaiPan commented 3 years ago

For me, this issue disappears when I run nvtop inside a docker container.

docker build --tag nvtop .
docker run --interactive --tty --rm --runtime=nvidia --gpus all --pid=host nvtop

Screenshot 3

Syllo commented 3 years ago

Interesting, so indeed the function nvmlDeviceGetProcessUtilization can be a bit finicky.

@XuehaiPan What is the timestamp that is passed to the function in the variable internal->last_utilization_timestamp. if it is larger or equal than the ones returned (the TS=...), this might be the way to filter the results. Although this parameter tells the driver to return the utilization that should be more recent!

XuehaiPan commented 3 years ago

What is the timestamp that is passed to the function in the variable internal->last_utilization_timestamp.

The first value is 0. And It gets a random value on each refresh (as internal->last_utilization_timestamp = samples[0].timeStamp on the last update):

Count=100 TS=0
PID=50554 %SM=37100 %ENC=50583 %DEC=50624 TS=217144956601725
PID=50021 %SM=38305 %ENC=49270 %DEC=50571 TS=158819300671603
PID=50017 %SM=0 %ENC=93761 %DEC=0 TS=49936
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
# ...

Count=100 TS=217144956601725
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
# ...
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=82417 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
# ...

# ...

Count=100 TS=1047527424
PID=38576 %SM=47961 %ENC=1047527424 %DEC=0 TS=1047527424
PID=875831345 %SM=909586487 %ENC=926037305 %DEC=840970297 TS=3617010763600568368
PID=540094517 %SM=875573536 %ENC=874524960 %DEC=540029472 TS=4048798948548490784
PID=842145840 %SM=540024880 %ENC=875837238 %DEC=926103344 TS=2320532713153574197
PID=892680496 %SM=840970272 %ENC=909189170 %DEC=858863671 TS=3467824627879784480
PID=824193585 %SM=842342455 %ENC=825570848 %DEC=807415840 TS=3539882221917712416
PID=807415840 %SM=2609 %ENC=1178666720 %DEC=32532 TS=3539882221917712416
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=80353 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=1178665568 %SM=926234675 %ENC=1869116537 %DEC=1914712942 TS=80305
PID=1651076204 %SM=943071286 %ENC=825040944 %DEC=842150944 TS=3978147638086279258
PID=909586487 %SM=540095029 %ENC=842215732 %DEC=540226359 TS=3611941001401742391
PID=858993206 %SM=807417138 %ENC=842217015 %DEC=540024882 TS=2319389199467295520
PID=942743600 %SM=909194549 %ENC=807415840 %DEC=807415840 TS=4120854352152769588
PID=909455904 %SM=909586720 %ENC=807415840 %DEC=540487968 TS=3616728262244775473
PID=807416119 %SM=540024880 %ENC=540024880 %DEC=667953 TS=2319389199166349369
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0

Syllo commented 3 years ago

I'm confused! Or I don't get the same behavior with my driver version ... I get this:

Samples 3, last TS 0
PID 18429, GPU% 1, ENC% 0, DEC% 0, TS 1623157629331552
PID 7044, GPU% 12, ENC% 0, DEC% 0, TS 1623157633175692
PID 7180, GPU% 7, ENC% 0, DEC% 0, TS 1623157632005826
Samples 2, last TS 1623157629331552
PID 7044, GPU% 11, ENC% 0, DEC% 0, TS 1623157633175692
PID 7180, GPU% 9, ENC% 0, DEC% 0, TS 1623157632005826
Samples 2, last TS 1623157633175692
PID 7044, GPU% 4, ENC% 0, DEC% 0, TS 1623157637687768
PID 7180, GPU% 16, ENC% 0, DEC% 0, TS 1623157639191912
Samples 2, last TS 1623157637687768
PID 7044, GPU% 1, ENC% 0, DEC% 0, TS 1623157639526261
PID 7180, GPU% 19, ENC% 0, DEC% 0, TS 1623157640194710

Did you insert your code after the second call to nvmlDeviceGetProcessUtilization? Just before

if (samples_count) {
      internal->last_utilization_timestamp = samples[0].timeStamp;
    }

XuehaiPan commented 3 years ago

Reply to this, yes.

Did you insert your code after the second call to nvmlDeviceGetProcessUtilization? Just before
if (samples_count) {
  internal->last_utilization_timestamp = samples[0].timeStamp;
}

Normally, the sample_count is excatly equals the number of processes on the GPU. For me, there are 2 zombie processes but the sample_count=100. Both for nvtop and pynvml.

I think it maybe a upstream driver issue caused by PyTorch (pytorch/pytorch#4293) or TeamViewer for @shmalex (I'm not so sure). We can set the utilization rates to zero (not clamp into [0, 100]) when NVML gets illegal values. To be further (optional), we can prompt some errors that request the user to reset the GPU or reboot the machine when exiting nvtop. (We should distinguish between error processes and normal processes which at exiting (zombie process for 1-2 sec))

XuehaiPan commented 3 years ago

I created a gist here: https://gist.github.com/XuehaiPan/bc4834bf40723fe0c994b03d9c0473e4

git clone https://gist.github.com/bc4834bf40723fe0c994b03d9c0473e4.git nvml-exmaple
cd nvml-example

# C
sed -i 's|/usr/local/cuda-10.1|/usr/local/cuda|g' Makefile  # change the CUDA_HOME
make && ./example

# Python
pip3 install nvitop
python3 example.py

The files output-c.txt and output-py.txt in the gist are the output on my machine.

On my machine, I'm sure this issue is not caused by nvtop.

Syllo commented 3 years ago

Thanks for all the info. So it seems that there is something wrong with either the driver or the function nvmlDeviceGetProcessUtilization. My solution to avoid the wrong samples after looking at the results that you provided is:

Filter out the samples with no existing processes as returned by nvmlDeviceGetComputeRunningProcesses or nvmlDeviceGetGraphicsRunningProcesses.
If the process exists, further filter if the utilization rates are >100
Last filter out if the sample is older than the last seen timestamp

This should iron out the case where the driver misbehaves for whatever reason.

shmalex commented 3 years ago

I can confitm - the issues is not comming up again. 1.2.0 issues reproduced. on 1.2.1 - no issue @Syllo @XuehaiPan thank you very much.

Syllo commented 3 years ago

All right. Thanks for the feedback @shmalex. Take care.

Syllo / nvtop

The GPU usage shows huge % usage #110