ROCm / rocminfo

ROCm Application for Reporting System Info
Other
32 stars 30 forks source link

rocminfo fails with Ubuntu 20.04 and Vega 10 graphics #40

Closed taichichuan closed 1 week ago

taichichuan commented 3 years ago

I'm seeing something unusual and I'm hoping that someone can provide a clue as to what to do next. I've successfully installed ROCm 4.x on another machine using Ubuntu 20.04 and the stock 5.6.0-1047 oem kernel. But, in trying the same installation on a new platform, it's failing. Here's some info:

dmesg: [ 0.000000] Linux version 5.6.0-1047-oem (buildd@lgw01-amd64-055) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #51-Ubuntu SMP Fri Feb 5 11:32:46 UTC 2021 (Ubuntu 5.6.0-1047.51-oem 5.6.19) [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.6.0-1047-oem root=UUID=a2be4603-e2ce-44b2-9bd1-338faf709837 ro quiet splash vt.handoff=7

$ dmesg | grep kfd [ 1.482926] kfd kfd: amdgpu: Allocated 3969056 bytes on gart [ 1.483779] kfd kfd: amdgpu: added device 1002:15d8

$ groups mike adm cdrom sudo dip video plugdev render lpadmin lxd sambashare

$ lsmod | grep amdgpu amdgpu 5828608 5 amd_iommu_v2 20480 1 amdgpu amd_sched 36864 1 amdgpu amdttm 106496 1 amdgpu amdkcl 24576 2 amdttm,amdgpu i2c_algo_bit 16384 1 amdgpu drm_kms_helper 208896 1 amdgpu drm 540672 8 drm_kms_helper,amd_sched,amdttm,amdgpu,amdkclm

$ ls -la /dev/kfd crw-rw-r-- 1 root render 235, 0 Feb 10 17:07 /dev/kfd

However, when I try to run rocminfo I get:

ROCk module is loaded Unable to open /dev/kfd read-write: Bad address Failed to get user name to check for render group membership hsa api call failure at: /src/rocminfo/rocminfo.cc:1142 Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

I have rebooted and I also have verified that OpenGL is running via glxinfo.

I thought that Vega 10 was a supported platform. Is there something I need to pass to the kernel in the command line?

TIA,

Mike

fxkamd commented 3 years ago

Looks like a problem in kfd_open. Are there any errors or warnings in "dmesg" after you run rocminfo? Might be best to attach a complete dmesg log.

taichichuan commented 3 years ago

I rebooted and then issued the rocminfo command several times in succession. In looking at the dmesg output, I see two of these failures with each subsequent rocminfo call:

126.322773] amdgpu: Failure to set tba address. error -1. [ 126.323179] amdgpu: Failure to set tba address. error -1.

I've attached the dmesg in it's entirety...

dmesg.out.gz

fxkamd commented 3 years ago

You don't have a Vega10, you have a Raven APU. BTW, I have one in my PC at home, and I see the same problem on 5.6.0-1042-oem. Have you tried installing the rock-dkms package to update the kernel module? If you have, what does "dkms status" say?

taichichuan commented 3 years ago

Hmm... All of the descriptions for this unit say:

Product number 7UP36UA
Product name HP ENVY x360 - 15m-ds0023dx
Microprocessor AMD Ryzen™ 7 3700U with Radeon™ Vega 10 Graphics (2.3 GHz base clock, up to 4 GHz max boost clock, 6 MB cache, 4 cores)
Chipset AMD Integrated SoC
Memory, standard 8 GB DDR4-2400 SDRAM (2 x 4 GB)
Video graphics AMD Radeon™ RX Vega 10 Graphics

It figures that there would be an error in HP's description.

In any case, the dkms status shows: dkms status amdgpu, 4.0-23, 5.6.0-1047-oem, x86_64: installed

FWIW, I did install using the "apt install rocm-dkms" command. And, I actually saw it rebuild the dkms package for the OS release.

taichichuan commented 3 years ago

Hmm... All of the descriptions for this unit say:

Product number 7UP36UA
Product name HP ENVY x360 - 15m-ds0023dx
Microprocessor AMD Ryzen™ 7 3700U with Radeon™ Vega 10 Graphics (2.3 GHz base clock, up to 4 GHz max boost clock, 6 MB cache, 4 cores)
Chipset AMD Integrated SoC
Memory, standard 8 GB DDR4-2400 SDRAM (2 x 4 GB)
Video graphics AMD Radeon™ RX Vega 10 Graphics

It figures that there would be an error in HP's description.

In any case, the dkms status shows: dkms status amdgpu, 4.0-23, 5.6.0-1047-oem, x86_64: installed

FWIW, I did install using the "apt install rocm-dkms" command. And, I actually saw it rebuild the dkms package for the OS release.

And, Windows also identifies the GPU as Vega 10:

GPU - AMD Radeon(TM) RX Vega 10 Graphics - Primary/Integrated VRAM - 2048 MB - DDR4 1200 MHz Graphics Card Manufacturer - Powered by AMD Graphics Chipset - AMD Radeon(TM) RX Vega 10 Graphics Device ID - 15D8 Vendor ID - 1002 SubSystem ID - 85DD SubSystem Vendor ID - 103C Revision ID - C1 Bus Type - PCI Current Bus Settings - PCI BIOS Version - 016.002.000.011 BIOS Part Number - 113-PICASSO-117 BIOS Date - 2019/09/02 02:55 Usable Memory Size - 2048 MB Memory Type - DDR4 Memory Clock - 1200 MHz Core Clock - 1400 MHz Total Memory Bandwidth - 38 GByte/s Memory Bit Rate - 2.40 Gbps 2D Driver File Path - /REGISTRY/MACHINE/SYSTEM/CurrentControlSet/Control/Class/{4d36e968-e325-11ce-bfc1-08002be10318}/0000

OpenGL® API Version - 4.6 OpenCL™ API Version - 2.0

taichichuan commented 3 years ago

Whoops! Pressed the wrong button

fxkamd commented 3 years ago

OK, I've done some digging and found what's going on. Ubuntu started mounting /dev with the option noexec. That prevents us from mapping the TBA buffer into the user address space with executable permissions. This problem only affects APUs. Discrete GPUs map the TBA through a different mechanism. As a workaround you can use this command:

sudo mount -o remount,exec /dev

After that rocminfo should succeed. A more permanent solution is under investigation.

fxkamd commented 3 years ago

I think it's a misunderstanding between marketing and engineering names. What Marketing calls Vega10 is not what the engineers call Vega10.

taichichuan commented 3 years ago

OK, executing the mount command does allow me to run rocminfo and clinfo! That's great. What's not great is that I now have yet another damn device based on the GFX902 and that doesn't work in ROCm. That makes 3 of them at this point and all of them are different parts (Ryzen 7 4700U, V1605B and the Ryzen 7 3700U). Heavy sigh. I was hoping to be able to get at least one device that would actually work with ROCm, so I could do some comparison testing. But, it doesn't appear that I can find a combination that works. This sucks. But, thanks for your help. At least I've got all three platforms to the same dead end. ;-)

BTW, would you happen to know the correct line for /etc/fstab that would allow /dev to be mounted with the correct options at boot?

commandline-be commented 3 years ago

same observation here, rocminfo fails because /dev is not mounted with exec priv

rocminfo does show a secondary device but hashcat fails to see it

Ubuntu 20.04 with rocm 4.1.1

After 'mount -o remount,exec /dev' i now see good output for rocminfo and clinfo and hashcat -I HOWEVER, hashcat segfaults due to (anonymised with xxx)

/sys/bus/pci/devices/xxxx:xx:xx.0/hwmon/hwmon3/pwm1: No such file or directory

commandline-be commented 3 years ago

STATUS for Ubuntu 20.04 with rocm 4.1.1

same observation here, rocminfo fails because /dev is not mounted with exec priv rocminfo does show a secondary device but hashcat fails to see it

PARTIAL FIX After 'mount -o remount,exec /dev' i now see good output for rocminfo and clinfo and hashcat -I HOWEVER, hashcat segfaults due to (anonymised with xxx)

/sys/bus/pci/devices/xxxx:xx:xx.0/hwmon/hwmon3/pwm1: No such file or directory

FIX BUT HANG ---- after complete cleanup now everything work but the machine HANGS!

re-mounting /dev with exec privileges as mentioned before also fixes output such as below for hashcat -I clGetDeviceIDs(): CL_DEVICE_NOT_FOUND

ppanchad-amd commented 1 month ago

@taichichuan Apologies for the lack of response. Do you still need assistance with this ticket? If not, please close the ticket. Thanks!