XuehaiPan / nvitop

An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
https://nvitop.readthedocs.io
Apache License 2.0
4.56k stars 144 forks source link

[BUG] `nvitop.Device.from_cuda_visible_devices()` not detecting GPU #99

Closed juan-barajas-p closed 11 months ago

juan-barajas-p commented 11 months ago

Required prerequisites

What version of nvitop are you using?

1.3.0

Operating system and version

Pop!_OS 22.04 LTS

NVIDIA driver version

535.113.01

NVIDIA-SMI

Wed Oct  4 08:57:35 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070 ...    Off | 00000000:01:00.0  On |                  N/A |
| N/A   51C    P8              15W / 125W |     59MiB /  8192MiB |     13%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3081      G   /usr/lib/xorg/Xorg                           53MiB |
+---------------------------------------------------------------------------------------+

Python environment

Virtualenv created with micromamba v1.5.1 with mm create --name testing python=3.11, then installed nvitop with pip install nvitop.

Command output:

3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] linux
nvidia-ml-py==12.535.108
nvitop==1.3.0

Problem description

Using the following code snippet results in an empty list:

import nvitop; nvitop.Device.from_cuda_visible_devices()

Regardless of if CUDA_VISIBLE_DEVICES is set or not.

Steps to Reproduce

Command lines:

python -c "import nvitop; print(nvitop.Device.from_cuda_visible_devices())"

Traceback

N/A

Logs

N/A

Expected behavior

I would expect to see the same number of devices given by nvitop.Device.all() when calling nvitop.Device.from_cuda_visible_devices() if CUDA_VISIBLE_DEVICES is not set or if CUDA_VISIBLE_DEVICES is set to all GPUs in the system.

Additional context

This has never happened before on any previous machines using the same nvitop version and OS, which at first led me to believe it was a problem with this particular machine's setup, but after some more testing I'm not so sure. I'm giving the following information to see if nvitop can be improved to deal with this situation accordingly.

I looked into it in more detail, and it turns out that visible_device_indices is empty in this machine whereas in other machines it does find the correct GPU uuid.

# file: api.device.py, method: from_cuda_visible_devices

visible_device_indices = Device.parse_cuda_visible_devices()  # value: []

Looking closer at _parse_cuda_visible_devices, the complete uuid is correctly detected by _get_all_physical_device_attrs():

# file: api.device.py, function: _parse_cuda_visible_devices

physical_device_attrs = _get_all_physical_device_attrs()  # value: _PhysicalDeviceAttrs(index=0, name='NVIDIA GeForce RTX 3070 Ti Laptop GPU', uuid='GPU-13096139-7ada-8313-ee08-000dd8540fe1', support_mig_mode=False))])

But the subprocess that parses visible devices to uuids appears to be missing the last part of the uuid. This causes further logic to assume this uuid is for a MIG device (as it doesn't find it in physical_device_attrs), and among other things, it ends up not showing up as a valid GPU detected by nvitop.

# file: api.device.py, function: _parse_cuda_visible_devices

raw_uuids = subprocess.check_output(...)  # value: ['13096139-7ada-8313-ee08-']

I kept on tracking the incorrect UUID to cuDeviceGetUuid and it appears that this is the point where the uuid is incomplete.

# file: api.libcuda.py, function: cuDeviceGetUuid

uuid = ''.join(map('{:02x}'.format, uuid.value))  # value: "130961397ada8313ee08"

As I understand, this is just a wrapper for using the CUDA driver API, directly using the function cuDeviceGetUuid_v2, so I tried to use NVIDIA's cuda-python to see if I could replicate it, but oddly enough this does return the full uuid of the GPU.

micromamba create --name testing_2 python=3.11
micromamba activate
pip install cuda-python  # v12.2.0
python -c "from cuda import cuda; cuda.cuInit(0); print(cuda.cuDeviceGetUuid_v2(0)[1])"
# prints: bytes : 130961397ada8313ee08000dd8540fe1

As using the python wrappers of the API returns the expected value, I wonder if there's something nvitop's implementation could do to mitigate this issue.

XuehaiPan commented 11 months ago

Hi, @juan-barajas-p thanks for raising this! Much appreciate the detailed context for the investigation.

The cause is the UUID contains the null character \x00, which terminates the string buffer.

Your UUID:

uuid = '130961397ada8313ee08000dd8540fe1'

stripped uuid:

uuid = '130961397ada8313ee08'

as we can see there is a 00 after ..ee08 and the string buffer terminates early at the null character.

I will submit a quick fix for this.

XuehaiPan commented 11 months ago

Hi @juan-barajas-p, I create a fix to resolve this issue:

You can try it via:

python3 -m pip install git+https://github.com/XuehaiPan/nvitop.git@fix-cuDeviceGetUuid

BTW, you can use Device.cuda.all() or CudaDevice.all() to get all CUDA visible devices.

from nvitop import Device, CudaDevice

# Use this only when you don't want to use the `CUDA_VISIBLE_DEVICES` from the environment variable
all_cuda_devices = Device.from_cuda_visible_devices()             # from the environment variable
other_cuda_devices = Device.from_cuda_visible_devices('4,3,0,1')  # do not use the environment variable

# alternatives if you only read `CUDA_VISIBLE_DEVICES` from the environment variable
all_cuda_devices = Device.cuda.all()  # you can have only `from nvitop import Device`
all_cuda_devices = CudaDevice.all()
juan-barajas-p commented 11 months ago

Hi! Thank you for the very quick response. Good job with this library, as it's the easiest method of interacting with GPU metrics that I've used.

Ohh of course that's the problem haha. Also, thank you for the tip! I didn't know you could do it that way.

It almost works. I think you meant to apply the fix to api.libcuda.cuDeviceGetUuid instead of api.libcuda.cuDeviceGetUuid_v2, as it's the entrypoint used in api.device._cuda_visible_devices_parser? But if I use cuDeviceGetUuid_v2 is does solve the issue!

XuehaiPan commented 11 months ago

It almost works. I think you meant to apply the fix to api.libcuda.cuDeviceGetUuid instead of api.libcuda.cuDeviceGetUuid_v2, as it's the entrypoint used in api.device._cuda_visible_devices_parser?

Thanks for the notes. I have updated the fix accordingly.