Closed juan-barajas-p closed 1 year ago
Hi, @juan-barajas-p thanks for raising this! Much appreciate the detailed context for the investigation.
The cause is the UUID contains the null character \x00
, which terminates the string buffer.
Your UUID:
uuid = '130961397ada8313ee08000dd8540fe1'
stripped uuid:
uuid = '130961397ada8313ee08'
as we can see there is a 00
after ..ee08
and the string buffer terminates early at the null character.
I will submit a quick fix for this.
Hi @juan-barajas-p, I create a fix to resolve this issue:
You can try it via:
python3 -m pip install git+https://github.com/XuehaiPan/nvitop.git@fix-cuDeviceGetUuid
BTW, you can use Device.cuda.all()
or CudaDevice.all()
to get all CUDA visible devices.
from nvitop import Device, CudaDevice
# Use this only when you don't want to use the `CUDA_VISIBLE_DEVICES` from the environment variable
all_cuda_devices = Device.from_cuda_visible_devices() # from the environment variable
other_cuda_devices = Device.from_cuda_visible_devices('4,3,0,1') # do not use the environment variable
# alternatives if you only read `CUDA_VISIBLE_DEVICES` from the environment variable
all_cuda_devices = Device.cuda.all() # you can have only `from nvitop import Device`
all_cuda_devices = CudaDevice.all()
Hi! Thank you for the very quick response. Good job with this library, as it's the easiest method of interacting with GPU metrics that I've used.
Ohh of course that's the problem haha. Also, thank you for the tip! I didn't know you could do it that way.
It almost works. I think you meant to apply the fix to api.libcuda.cuDeviceGetUuid
instead of api.libcuda.cuDeviceGetUuid_v2
, as it's the entrypoint used in api.device._cuda_visible_devices_parser
? But if I use cuDeviceGetUuid_v2 is does solve the issue!
It almost works. I think you meant to apply the fix to
api.libcuda.cuDeviceGetUuid
instead ofapi.libcuda.cuDeviceGetUuid_v2
, as it's the entrypoint used inapi.device._cuda_visible_devices_parser
?
Thanks for the notes. I have updated the fix accordingly.
Required prerequisites
What version of nvitop are you using?
1.3.0
Operating system and version
Pop!_OS 22.04 LTS
NVIDIA driver version
535.113.01
NVIDIA-SMI
Python environment
Virtualenv created with micromamba v1.5.1 with
mm create --name testing python=3.11
, then installed nvitop withpip install nvitop
.Command output:
Problem description
Using the following code snippet results in an empty list:
Regardless of if
CUDA_VISIBLE_DEVICES
is set or not.Steps to Reproduce
Command lines:
Traceback
Logs
Expected behavior
I would expect to see the same number of devices given by
nvitop.Device.all()
when callingnvitop.Device.from_cuda_visible_devices()
ifCUDA_VISIBLE_DEVICES
is not set or ifCUDA_VISIBLE_DEVICES
is set to all GPUs in the system.Additional context
This has never happened before on any previous machines using the same
nvitop
version and OS, which at first led me to believe it was a problem with this particular machine's setup, but after some more testing I'm not so sure. I'm giving the following information to see ifnvitop
can be improved to deal with this situation accordingly.I looked into it in more detail, and it turns out that
visible_device_indices
is empty in this machine whereas in other machines it does find the correct GPU uuid.Looking closer at
_parse_cuda_visible_devices
, the complete uuid is correctly detected by_get_all_physical_device_attrs()
:But the subprocess that parses visible devices to uuids appears to be missing the last part of the uuid. This causes further logic to assume this uuid is for a MIG device (as it doesn't find it in physical_device_attrs), and among other things, it ends up not showing up as a valid GPU detected by
nvitop
.I kept on tracking the incorrect UUID to
cuDeviceGetUuid
and it appears that this is the point where the uuid is incomplete.As I understand, this is just a wrapper for using the CUDA driver API, directly using the function
cuDeviceGetUuid_v2
, so I tried to use NVIDIA's cuda-python to see if I could replicate it, but oddly enough this does return the full uuid of the GPU.As using the python wrappers of the API returns the expected value, I wonder if there's something
nvitop
's implementation could do to mitigate this issue.