Open danielzgtg opened 1 year ago
Rembrandt(VCN 3.0) here, tell me if I can help you
Hi @GreyXor, more testers are always more help!
What you can do immediately is delete the && fam < VEGAM
from the following code and report back whether it works on your card: https://github.com/clbr/radeontop/blob/e3bbf06eaed49746f2838a60eb01e7edfc185da5/detect.c#L374
What I am blocked on is more validation and assurance that this or something else is the correct register and is safe for the GPU. Each documentation PDF is ~300 pages long and someone else needs to read time if you want this done faster. I am looking for a citation to evidence documenting the old UVD/VCE registers, and a citation to evidence documenting to new VCN registers. This could either be from the documentation or a mailing list post by someone reliable.
Well, for now when I execute radontop from latest master commit, all my Arch Linux is freezing
Oh no! Userspace shouldn't be able to freeze the system.
dmesg > dmesg.txt
and post that filesudo systemctl restart name-of-your-display-manager
. Does that log you out and fix the graphics?I just tested on Navi23. My system didn't crash at all, but this might be because I'm on discrete while you're on integrated.
UVD and VCE were both stuck at 0. I tested in Firefox and they were stuck there even on AV1. I still have no clue what the register is supposed to be at.
all my Arch Linux is freezing
You can help by filing a kernel bug report as suggested by https://github.com/clbr/radeontop/issues/87#issuecomment-705477114 . I think your issue is covered by #87 not the new video code.
Running Linux 6.1.0-rc4 on gentoo here, radeontop causes amdgpu to crash as well, rendering the system unusable.
However, I did the same test on 6.0.0 daily builds from ubuntu and 5.15 binary kernels from gentoo, both with success, but quite surprisingly not on 6.0.0-rc4 (gentoo).
Edit: Specs: 6850u(rembrandt), t14s gen 3 amd
I found a hardware error after installing https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.1-rc4/ . I ran sudo ./radeontop --mem
on Git master and was at KDE Wayland desktop with only wobbly windows Konsole.
Here are the relevant lines:
[ 1.013284] mce: [Hardware Error]: Machine check events logged
[ 1.013285] mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 5: bea0000000000108
[...]
[ 1.013290] mce: [Hardware Error]: TSC 0 ADDR ffffff894503f0 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[ 1.013293] mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1667925280 SOCKET 0 APIC 10 microcode a20120a
My dual monitors and my backlit keyboard turned black. There is still no indication this is an amdgpu or a radeontop problem. I could do serial debugging but I doubt it's worth the time because it's a machine check error which might not even leave a dmesg or kernel log.
I turns out this also affects ubuntu kernel 5.19.0-21.21. Both this and the other kernel, it never happens immediately. It's probably a race condition with memory mapping. I'm on Navi23 with X570 socket CPU. I have PCI atomics and AMD IOMMU active.
Normal radeontop without --mem
survived a full glmark2
, PyTorch CUDA, Firefox, AV1, and Minecraft with heavy shaders.
I ran it a third time with --mem
. Like before, all the numbers were at or around "100%" when idle. Oddly, it kept working throughout glmark2
, and I sometimes moved my mouse. When the benchmark ended and I only moved my mouse a little bit, only now did my whole computer go black. Without --mem
I did not observe this. In TTY1 with the display manager stopped, I also did not observe this even after several minutes.
The other issue mentions kernel 5.2. The main thing I see there is BACO which is power saving, which has a reputation of causing problems even with other things like USB. I added amdgpu.runpm=0
to disable power saving and sudo radeontop --mem
did not freeze even after several minutes. https://gitlab.freedesktop.org/drm/amd/-/issues/1820 mentions that parameter and so does https://wiki.archlinux.org/title/AMDGPU .
build from git with detect patched
lspci -nn | grep VGA; uname -r
04:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] VanGogh [AMD Custom GPU 0405] [1002:163f] (rev ae)
6.1.12-valve2-1-neptune-61
UVD and VCE always 100%
After digging some time I got something looking at https://github.com/torvalds/linux/blob/fff5a5e7f528b2ed2c335991399a766c2cf01103/drivers/gpu/drm/amd/include/asic_reg/vcn/vcn_3_0_0_offset.h#LL796C16-L796C23
UVD_STATUS= 0x1fa00
static int getuvd_amdgpu(uint32_t *out) {
return amdgpu_read_mm_registers(amdgpu_dev, (UVD_STATUS) / 4, 1,
0xffffffff, 0, out);
}
I got 0xDEADBEEF
while idle and 0x00000026
while decoding(tested on chomium accelerated) and encoding (tested on ffmpeg vaapi) 0x00000046
while running ffmpeg transcode
uvd:0x00000026
uvd:0x00000026
uvd:0x00000046
uvd:0x00000046
uvd:0x00000046
uvd:0x00000026
uvd:0x00000026
uvd:0x00000046
uvd:0x00000046
uvd:0x00000046
uvd:0x00000026
uvd:0xDEADBEEF
uvd:0x00000046
uvd:0x00000046
uvd:0x00000046
uvd:0x00000026
uvd:0x00000026
uvd:0x00000006
uvd:0x00000046
uvd:0x00000046
uvd:0x00000046
uvd:0xDEADBEEF
uvd:0x00000026
and while playing video on chromium
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0x00000026
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
uvd:0xDEADBEEF
Device: steam deck/vangogh/rdna2
UVD_STATUS= 0x1fa00
amdgpu_read_mm_registers
returns -14 on RX 6650 XT and the value is not written. Does it work on any other GPU than the one in the Steam Deck, or is it Valve-specific?
uvd:0xDEADBEEF
This looks like a value PCI is returning when it can't connect to the GPU. Things like this might crash people's computers again.
Should be for all VCN 3 based cores, maybe deadbeef the value set by default by the driver when core not used, also found this
enum engine_status_constants {
UVD_PGFSM_STATUS__UVDM_UVDU_PWR_ON = 0x2AAAA0,
UVD_PGFSM_STATUS__UVDM_UVDU_PWR_ON_2_0 = 0xAAAA0,
UVD_PGFSM_STATUS__UVDM_UVDU_UVDLM_PWR_ON_3_0 = 0x2A2A8AA0,
UVD_PGFSM_CONFIG__UVDM_UVDU_PWR_ON = 0x00000002,
UVD_STATUS__UVD_BUSY = 0x00000004,
GB_ADDR_CONFIG_DEFAULT = 0x26010011,
UVD_STATUS__IDLE = 0x2,
UVD_STATUS__BUSY = 0x5,
UVD_POWER_STATUS__UVD_POWER_STATUS_TILES_OFF = 0x1,
UVD_STATUS__RBC_BUSY = 0x1,
UVD_PGFSM_STATUS_UVDJ_PWR_ON = 0,
};
UVD_STATUS= 0x1fa00
amdgpu_read_mm_registers
returns -14 on RX 6650 XT and the value is not written. Does it work on any other GPU than the one in the Steam Deck, or is it Valve-specific?
I think it depends on the Linux Kernel version.
AMDGPU driver had a bug that allowed amdgpu_read_mm_registers
function to succeed on registers that were not originally allowed.
To check VCN status, fdinfo
or gpu_metrics
must be supported.
I think it is not readable from the register.
The UVDVCE support in #140 did not enable VCN support. In https://github.com/clbr/radeontop/issues/29#issuecomment-1272401377 I explained why not. It is requested by @userofryzen in https://github.com/clbr/radeontop/issues/29#issuecomment-1272303448 .
This is because I don't know if UVD and VCN use the same registers. If they are the same then the PR will be easy and simple. If they are different then someone needs to go and find the documentation that defines where the VCN usage register in located. I have a Navi2 card arriving in about 1 week, so I might look into this if I decide to use that card for video.