Syllo / nvtop

GPU & Accelerator process monitoring for AMD, Apple, Huawei, Intel, NVIDIA and Qualcomm
Other
8.06k stars 292 forks source link

Ubuntu 22.04 / nvtop v3.0.0 - initDeviceSysfsPaths fails for AMDGPU on AMD Advantage Laptop #187

Closed Fludizz closed 7 months ago

Fludizz commented 1 year ago

On an Asus AMD Advantage laptop (ROG Strix G513QY), running Ubuntu 22.04 with nvtop 3.0.0 from the PPA, it fails to start with the following error:

fludizz@pauwel:~$ nvtop -d 50
nvtop: ./src/extract_gpuinfo_amdgpu.c:324: initDeviceSysfsPaths: Assertion `gpu_info->hwmonDevice != NULL' failed.
Aborted

The system is fully up-to-date and has been rebooted after installing nvtop. I have confirmed it is using the amdgpu driver and not radeon and the device is properly detected by the system:

fludizz@pauwel:~$ lspci | egrep -i "display|vga"
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT / 6800M] (rev c3)
07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne (rev c4)
fludizz@pauwel:~$ lsmod | egrep "radeon|amdgpu"
amdgpu               9867264  17
iommu_v2               24576  1 amdgpu
gpu_sched              45056  1 amdgpu
i2c_algo_bit           16384  1 amdgpu
drm_ttm_helper         16384  1 amdgpu
ttm                    86016  2 amdgpu,drm_ttm_helper
drm_kms_helper        311296  1 amdgpu
drm                   622592  14 gpu_sched,drm_kms_helper,amdgpu,drm_ttm_helper,ttm
fludizz@pauwel:~$ glxinfo | grep "OpenGL renderer"
OpenGL renderer string: RENOIR (renoir, LLVM 15.0.6, DRM 3.42, 5.15.0-60-generic)
fludizz@pauwel:~$ DRI_PRIME=1 glxinfo | grep "OpenGL renderer"
OpenGL renderer string: AMD Radeon RX 6800M (navi22, LLVM 15.0.6, DRM 3.42, 5.15.0-60-generic)

There are HWMON devices for both iGPU and dGPU, I've used those to monitor power states and GPU usage before but I wanted a more convenient way to monitor this rather than manually parsing sysfs information (idle system):

fludizz@pauwel:~$ cat /sys/module/amdgpu/drivers/pci\:amdgpu/0000\:03\:00.0/gpu_busy_percent 
0
fludizz@pauwel:~$ cat /sys/module/amdgpu/drivers/pci\:amdgpu/0000\:07\:00.0/gpu_busy_percent 
0

libdrm-dev and libsystemd-dev are both installed, in attempt to resolve this issue. Is this a bug in nvtop or is there something specific with Ubuntu 22.04 that I have missed?

Syllo commented 1 year ago

Hello @Fludizz,

Thanks for the bug report.

Is there an hwmon folder in /sys/module/amdgpu/drivers/pci\:amdgpu/0000\:07\:00.0 on your system? If not, what is your kernel version (uname -r)?

I use this location to find some GPU metrics, such as fan speed and temperature. I will work out a patch to ignore these metrics if the folder is missing.

Fludizz commented 1 year ago
fludizz@pauwel:~$ ls /sys/module/amdgpu/drivers/pci\:amdgpu/0000\:07\:00.0/hwmon/
hwmon5
fludizz@pauwel:~$ ls /sys/module/amdgpu/drivers/pci\:amdgpu/0000\:03\:00.0/hwmon/
hwmon4

The folders are present and contain hwmon5 for the integrated GPU (PCI ID 7) and hwmon4 for dedicated GPU (PCI ID 3).

Regarding kernel version, it just got updated to a newer version since I opened the ticket (but the issue is unchanged):

fludizz@pauwel:~$ uname -r
5.15.0-66-generic
fludizz@pauwel:~$ 
Midnight145 commented 1 year ago

I'm having the same issue, with the same ROG Strix G513QY

Midnight145 commented 1 year ago
$ ls /sys/module/amdgpu/drivers/pci\:amdgpu/0000\:07\:00.0
ls: cannot access '/sys/module/amdgpu/drivers/pci:amdgpu/0000:07:00.0': No such file or directory

$ ls /sys/module/amdgpu/drivers/pci:amdgpu/ | grep 0000
0000:03:00.0
0000:08:00.0

$ uname -r                                                
5.15.0-46-generic

$ lsmod | egrep "radeon|amdgpu"
amdgpu               9850880  100
iommu_v2               24576  1 amdgpu
gpu_sched              45056  1 amdgpu
i2c_algo_bit           16384  1 amdgpu
drm_ttm_helper         16384  1 amdgpu
ttm                    86016  3 vmwgfx,amdgpu,drm_ttm_helper
drm_kms_helper        311296  2 vmwgfx,amdgpu
drm                   622592  29 vmwgfx,gpu_sched,drm_kms_helper,amdgpu,drm_ttm_helper,ttm

$ lspci | egrep -i "display|vga"
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT / 6800M] (rev c3)
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne (rev c7)
fletcher-blight commented 1 year ago

I am having the same issue on a fresh install of Ubuntu 23.04 on a desktop with a 7900XTX GPU. Is there any progress with this?

brimston3 commented 1 year ago

Appears to be fixed in 3.0.2 by commit 4fdf2f0. I can reproduce it by building the package under deb12 dpkg-buildpackage using version 3.0.1 (debian's 3.0.1-1 version is source-identical to the 3.0.1 tag). Before I got there, I stepped through the dbgsym in gdb until I found the second GPU was overwriting the previous entry with its pci businfo and leaving the second entry all zeros.

I rebuilt the debian package using 3.0.2 and the issue goes away. It would only show up if you had multiple AMD GPUs in a system.

Craig-Macomber commented 1 year ago

Appears to be fixed in 3.0.2 by commit 4fdf2f0. I can reproduce it by building the package under deb12 dpkg-buildpackage using version 3.0.1 (debian's 3.0.1-1 version is source-identical to the 3.0.1 tag). Before I got there, I stepped through the dbgsym in gdb until I found the second GPU was overwriting the previous entry with its pci businfo and leaving the second entry all zeros.

I rebuilt the debian package using 3.0.2 and the issue goes away. It would only show up if you had multiple AMD GPUs in a system.

I suspect the integrated GPU in the Ryzen 7000 series counts as a second GPU causing this issue to impact far more people than it otherwise would, since I hit this with only one discrete GPU.

Craig-Macomber commented 1 year ago

I have confirmed that the AppImage for 3.02 that is supposed to have this fix works correctly. I'll use if until my deb gets an update :).

Fludizz commented 1 year ago

Excellent! I'll wait for version 3.0.2 to be pushed to the PPA to confirm. Once its in that repo, I will retest and validate to close the bug.

caleb87 commented 1 year ago

I had this problem as well, so I used git clone to make it, and I can confirm it works fine now. Awesome program! Best GPU monitor I could find.

Fludizz commented 1 year ago

@flexiondotorg - pinging you here as you manage the PPA for nvtop - would you be able to update the repo to include version 3.0.2?

I have installed the .deb package from Ubuntu "Manic Minotaur" repository and that version (copied from Debian Sid) works without dependency issues on Ubuntu 22.04 as well.

qwertychouskie commented 9 months ago

I messaged the owner of the PPA, so hopefully it gets updated.

This issue should probably be closed, since the issue is fixed and released in nvtop itself, and none of the nvtop devs are responsible for the PPA. Any Ubuntu users that don't already have a newer version in their distro repos can use the Snap package.

Syllo commented 7 months ago

Cheers

cweiske commented 6 months ago

I had the same problem on Debian 12. Manually installing 3.0.2 from https://packages.debian.org/trixie/nvtop made the error go away.