geerlingguy / raspberry-pi-pcie-devices

Raspberry Pi PCI Express device compatibility database
http://pipci.jeffgeerling.com
GNU General Public License v3.0
1.55k stars 142 forks source link

Test GPU (AMD Radeon RX 6700 XT) #222

Open geerlingguy opened 2 years ago

geerlingguy commented 2 years ago

Working branch: https://github.com/geerlingguy/linux/pull/1

Just received an OEM AMD Radeon RX 6700 XT in the mail. I was able to get it at MSRP+Shipping, which is something of a miracle these days:

DSC02333

DSC02363

I will be interested in seeing what, if anything, the card does when powered up and plugged into the Compute Module 4 IO Board!

The following issues are closely related:

Latest recap: https://github.com/geerlingguy/raspberry-pi-pcie-devices/issues/222#issuecomment-919530424

geerlingguy commented 2 years ago

Now running into (with memory training disabled by just adding a return in that function):

[   86.231170] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[   86.231211] DEBUG: Passed amdgpu_device_ip_init 2394 
[   86.749608] DEBUG: On IP block <smu>
[   87.269896] DEBUG: Passed amdgpu_device_ip_init 2394 
[   87.805599] DEBUG: On IP block <gfx_v10_0>

[   88.338659] Unable to handle kernel paging request at virtual address ffffffc011d47000
[   88.338800] Mem abort info:
[   88.338842]   ESR = 0x96000061
[   88.338888]   EC = 0x25: DABT (current EL), IL = 32 bits
[   88.338956]   SET = 0, FnV = 0
[   88.339002]   EA = 0, S1PTW = 0
[   88.339048]   FSC = 0x21: alignment fault
[   88.339101] Data abort info:
[   88.339143]   ISV = 0, ISS = 0x00000061
[   88.339195]   CM = 0, WnR = 1
[   88.339239] swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000001135000
[   88.339321] [ffffffc011d47000] pgd=10000000fbfff003, p4d=10000000fbfff003, pud=10000000fbfff003, pmd=1000000040a97003, pte=006800004ec9470f
[   88.339504] Internal error: Oops: 96000061 [#1] PREEMPT SMP
[   88.339571] Modules linked in: amdgpu(+) drm_ttm_helper ttm i2c_algo_bit bnep hci_uart btbcm bluetooth ecdh_generic ecc 8021q garp stp llc snd_soc_hdmi_codec brcmfmac brcmutil vc4 v3d cec gpu_sched drm_kms_helper cfg80211 bcm2835_codec(C) bcm2835_v4l2(C) bcm2835_isp(C) rfkill raspberrypi_hwmon bcm2835_mmal_vchiq(C) v4l2_mem2mem drm videobuf2_vmalloc videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 i2c_brcmstb drm_panel_orientation_quirks videobuf2_common dwc2 roles snd_soc_core videodev snd_compress snd_bcm2835(C) mc vc_sm_cma(C) snd_pcm_dmaengine snd_pcm snd_timer snd syscopyarea sysfillrect sysimgblt fb_sys_fops backlight rpivid_mem nvmem_rmem uio_pdrv_genirq uio i2c_dev aes_neon_bs sha256_generic aes_neon_blk crypto_simd cryptd ip_tables x_tables ipv6
[   88.340511] CPU: 0 PID: 689 Comm: modprobe Tainted: G         C        5.14.2-v8+ #1
[   88.340600] Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
[   88.340670] pstate: 40000005 (nZcv daif -PAN -UAO -TCO BTYPE=--)
[   88.340741] pc : __memset+0x16c/0x188
[   88.340802] lr : gfx_v10_0_sw_init+0xa58/0x1380 [amdgpu]
[   88.341683] sp : ffffffc01201b720
[   88.341726] x29: ffffffc01201b720 x28: ffffff804def4000 x27: ffffff804dee0000
[   88.341819] x26: ffffff804dee8000 x25: 0000000000009430 x24: 00000000000093b8
[   88.341910] x23: 00000000000093c0 x22: 0000000000004000 x21: ffffff804deeb5d0
[   88.341999] x20: ffffff804dee9750 x19: ffffffc0112f8948 x18: fffffffe013f7f82
[   88.342087] x17: 0000000000010000 x16: ffffff804178a000 x15: c2fb1b652c1ffd7e
[   88.342176] x14: 7d00526c8d87f496 x13: 8a9ab05cbcd43841 x12: ffffffc01149a640
[   88.342264] x11: ffffff8040aa3440 x10: fffffffe00000000 x9 : 0000000000000000
[   88.342354] x8 : ffffffc011d47000 x7 : 0000000000000000 x6 : 000000000000003f
[   88.342443] x5 : 0000000000000040 x4 : 0000000000000000 x3 : 0000000000000004
[   88.342530] x2 : 0000000000003fc0 x1 : 0000000000000000 x0 : ffffffc011d47000
[   88.342620] Call trace:
[   88.342654]  __memset+0x16c/0x188
[   88.342703]  amdgpu_device_init+0x1458/0x1f98 [amdgpu]
[   88.343462]  amdgpu_driver_load_kms+0x30/0x2b8 [amdgpu]
[   88.344213]  amdgpu_pci_probe+0xe4/0x1b0 [amdgpu]
[   88.344957]  pci_device_probe+0xc0/0x190
[   88.345019]  really_probe+0xb8/0x318
[   88.345073]  __driver_probe_device+0x80/0xe8
[   88.345128]  driver_probe_device+0x88/0x118
[   88.345183]  __driver_attach+0x78/0x110
[   88.345236]  bus_for_each_dev+0x7c/0xd0
[   88.345286]  driver_attach+0x2c/0x38
[   88.345334]  bus_add_driver+0x194/0x1f8
[   88.345385]  driver_register+0x6c/0x128
[   88.345437]  __pci_register_driver+0x4c/0x58
[   88.348361]  amdgpu_init+0x64/0x1000 [amdgpu]
[   88.351968]  do_one_initcall+0x54/0x2c0
[   88.354870]  do_init_module+0x60/0x248
[   88.357747]  load_module+0x2208/0x2758
[   88.360597]  __do_sys_finit_module+0xbc/0xf8
[   88.363453]  __arm64_sys_finit_module+0x28/0x38
[   88.366304]  invoke_syscall+0x4c/0x110
[   88.369151]  el0_svc_common+0x100/0x128
[   88.371988]  do_el0_svc+0x30/0x98
[   88.374835]  el0_svc+0x24/0x38
[   88.377700]  el0t_64_sync_handler+0x90/0xb8
[   88.380582]  el0t_64_sync+0x178/0x17c
[   88.383451] Code: 91010108 54ffff4a 8b040108 cb050042 (d50b7428) 
[   88.386344] ---[ end trace 3903d5f2f6d73ff0 ]---
Coreforge commented 2 years ago

https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c#L4441 This one looks like another culprit. memset on a bo, which should be memset_io, which in this case should be memset_io_pcie

geerlingguy commented 2 years ago

Heh... I'm playing whack-a-mole with memset here—current debug cycle has me locking up on amdgpu_gfx_kiq_init. I'm just going to keep doing search-and-replace on memsets...

Coreforge commented 2 years ago

https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c#L348 another one

geerlingguy commented 2 years ago

The good news is, each time we encounter a memset that kills everything, if I replace it, we get further. So... that's not bad! I'll keep it up and keep documenting any more segfaults.

Coreforge commented 2 years ago

If it gets too tedious replacing them manually and recompiling everytime, doing it file by file should be fine too. The memset_io_pcie would just be slower, but I don't think there is something it wouldn't work with.

geerlingguy commented 2 years ago

Now we're dying somewhere inside amdgpu_vcn_resume(). Trying to find exactly where (I already updated those instances to memcpy_toio_pcie(), but maybe this is also dying like the memory training substitute memcpy_fromio_pcie() was?).

geerlingguy commented 2 years ago

Dying on this line: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c#L380

(note that I have it substituted for memset_io_pcie(ptr, 0, size); and it's still dying).

geerlingguy commented 2 years ago

@Coreforge - btw, let me just say that over the past 6 months or so I have learned a ton about the kernel, memory mapping, etc. from you, @dmarti-amd, @elFarto, @daniel-thompson, and a few others who've been so helpful in these issues, so thanks for that :)

I don't know if I'll be able to convey all that thanks in my next video on my GPU shenanigans, but I wanted to make sure I publicly said at least something, here. Nothing is not worth doing if you at least learn something along the way, and I've learned (and am learning) a lot.

But back to the issue at hand...

geerlingguy commented 2 years ago

Okay, so on the memset_io_pcie() call, I put:

            printk(KERN_ALERT "DEBUG: Passed %s %d \n",__FUNCTION__,__LINE__);
            msleep(500);

            printk(KERN_ALERT "DEBUG: ptr: %p, size: %u \n",ptr,size);
            msleep(500);

            memset_io_pcie(ptr, 0, size);

            printk(KERN_ALERT "DEBUG: Passed %s %d \n",__FUNCTION__,__LINE__);
            msleep(500);

And in the logs, I see:

[  212.908681] [drm] PSP loading VCN firmware
[  212.908692] DEBUG: Passed vcn_v3_0_sw_init 166 
[  213.438161] DEBUG: Passed amdgpu_vcn_resume 381 
[  213.950227] DEBUG: ptr: 00000000f786a209, size: 659456
(hard lockup at this point)
Coreforge commented 2 years ago

I haven't touched the kernel before starting this either, so I've been learning a lot too.

Since that line just zeros the remaining space after the firmware, you could also just try commenting out that line and see what happens, or disable vcn all together, as it's just needed for video encoding and decoding, and we don't even have a terminal yet. I haven't been able to the the UVD working on radeon either.

geerlingguy commented 2 years ago

Quick recap at this point in the thread:

  1. I'm using @Coreforge's patch to add in _pcie() memcpy/memset functions that are safe writing <= 32 bits.
  2. I disabled write combining / cache in amdgpu_ttm_tt_create() by setting caching = ttm_uncached;.
  3. I disabled memory training inside psp_v11_0_memory_training().
  4. I've played whack-a-mole with memset() calls causing segfaults.
  5. I commented out the memset line that was killing amdgpu_vcn_resume() (see a few comments above).
  6. (Added) I am loading the module with sudo modprobe amdgpu fw_load_type=0 so it will not load firmware with PSP.

Right now I just have all the changes stashed locally. I should really clean it up a tiny bit and put it in a branch somewhere in case someone else wants to try this stuff out :)

geerlingguy commented 2 years ago

Okay, with that commented, it's dying inside this block (https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c#L251):

        fw_shared = adev->vcn.inst[i].fw_shared_cpu_addr;
        fw_shared->present_flag_0 |= cpu_to_le32(AMDGPU_VCN_SW_RING_FLAG) |
                         cpu_to_le32(AMDGPU_VCN_MULTI_QUEUE_FLAG) |
                         cpu_to_le32(AMDGPU_VCN_FW_SHARED_FLAG_0_RB);
        fw_shared->sw_ring.is_enabled = cpu_to_le32(DEC_SW_RING_ENABLED);
V10lator commented 2 years ago

disable vcn all together, as it's just needed for video encoding and decoding

@geerlingguy If I where you I would listen to that as there's really no need for VCN ( https://en.wikipedia.org/wiki/Video_Core_Next ) before getting basic rendering done. IIRC early Linux drivers from AMD had it disabled, too.

geerlingguy commented 2 years ago

What's the easiest way to disable VCN in the driver? I'd be happy to do that since it seems to keep dying in there.

Coreforge commented 2 years ago

There might be a module parameter (not sure, radeon has one), otherwise, just immediately returning from vcn_v3_0_sw_init and vcn_v3_0_hw_init should hopefully do it.

geerlingguy commented 2 years ago

It looks like I could comment out the amdgpu_device_ip_block_add(adev, &vcn_v3_0_ip_block); line under CHIP_NAVY_FLOUNDER inside nv_set_ip_blocks() in nv.c?

Coreforge commented 2 years ago

That seems like a better way

geerlingguy commented 2 years ago

Well, good news, I found more memsets :D

[   54.845787] DEBUG: On IP block <jpeg_v3_0>
[   55.358252] DEBUG: Passed amdgpu_device_ip_init 2394 
[   55.869810] DEBUG: Passed amdgpu_device_ip_init 2440 
[   56.381849] DEBUG: Passed amdgpu_device_ip_init 2446 
[   56.894257] Unable to handle kernel paging request at virtual address ffffffc012435000
[   56.894369] Mem abort info:
[   56.894412]   ESR = 0x96000061
[   56.894459]   EC = 0x25: DABT (current EL), IL = 32 bits
[   56.894527]   SET = 0, FnV = 0
[   56.894572]   EA = 0, S1PTW = 0
[   56.894618]   FSC = 0x21: alignment fault
[   56.894673] Data abort info:
[   56.894715]   ISV = 0, ISS = 0x00000061
[   56.894767]   CM = 0, WnR = 1
[   56.894813] swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000001135000
[   56.894895] [ffffffc012435000] pgd=10000000fbfff003, p4d=10000000fbfff003, pud=10000000fbfff003, pmd=1000000040b25003, pte=006800004f00070f
[   56.895075] Internal error: Oops: 96000061 [#1] PREEMPT SMP
[   56.895143] Modules linked in: amdgpu(+) drm_ttm_helper ttm i2c_algo_bit bnep hci_uart btbcm bluetooth ecdh_generic ecc 8021q garp stp llc snd_soc_hdmi_codec brcmfmac brcmutil vc4 cec v3d gpu_sched drm_kms_helper cfg80211 rfkill drm drm_panel_orientation_quirks bcm2835_v4l2(C) bcm2835_codec(C) bcm2835_isp(C) snd_soc_core bcm2835_mmal_vchiq(C) v4l2_mem2mem videobuf2_vmalloc videobuf2_dma_contig dwc2 snd_bcm2835(C) snd_compress videobuf2_memops videobuf2_v4l2 snd_pcm_dmaengine videobuf2_common roles raspberrypi_hwmon i2c_brcmstb snd_pcm videodev snd_timer snd mc syscopyarea vc_sm_cma(C) rpivid_mem sysfillrect sysimgblt fb_sys_fops backlight uio_pdrv_genirq uio nvmem_rmem i2c_dev aes_neon_bs sha256_generic aes_neon_blk crypto_simd cryptd ip_tables x_tables ipv6
[   56.896082] CPU: 3 PID: 697 Comm: modprobe Tainted: G         C        5.14.2-v8+ #1
[   56.896170] Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
[   56.896242] pstate: 40000005 (nZcv daif -PAN -UAO -TCO BTYPE=--)
[   56.896313] pc : __memset+0x16c/0x188
[   56.896375] lr : amdgpu_sa_bo_manager_init+0xbc/0xf0 [amdgpu]
[   56.897212] sp : ffffffc0120cb760
[   56.897254] x29: ffffffc0120cb760 x28: ffffff804dbe0000 x27: 0000000000000008
[   56.897348] x26: ffffffc009626220 x25: 0000000000100000 x24: 0000000000006000
[   56.897440] x23: ffffff804dbe0000 x22: 0000000000000002 x21: 0000000000001000
[   56.897530] x20: 0000000000000000 x19: ffffff804dbe6478 x18: 0000000000000000
[   56.897620] x17: 0000000000000000 x16: 0000000000000000 x15: 000000558b9a5d60
[   56.897708] x14: 0000000000000000 x13: 0000000000000000 x12: ffffffc01149a880
[   56.897797] x11: ffffff8041f10840 x10: fffffffe00000000 x9 : 0000000000000000
[   56.897887] x8 : ffffffc012435000 x7 : 0000000000000000 x6 : 000000000000003f
[   56.897975] x5 : 0000000000000040 x4 : 0000000000000000 x3 : 0000000000000004
[   56.898063] x2 : 00000000000fffc0 x1 : 0000000000000000 x0 : ffffffc012435000
[   56.898152] Call trace:
[   56.898188]  __memset+0x16c/0x188
[   56.898239]  amdgpu_ib_pool_init+0x68/0x118 [amdgpu]
[   56.899001]  amdgpu_device_init+0x1324/0x1ff0 [amdgpu]
[   56.899750]  amdgpu_driver_load_kms+0x30/0x2b8 [amdgpu]
[   56.900498]  amdgpu_pci_probe+0xe4/0x1b0 [amdgpu]
[   56.901243]  pci_device_probe+0xc0/0x190
[   56.901306]  really_probe+0xb8/0x318
[   56.904160]  __driver_probe_device+0x80/0xe8
[   56.907001]  driver_probe_device+0x88/0x118
[   56.909835]  __driver_attach+0x78/0x110
[   56.912652]  bus_for_each_dev+0x7c/0xd0
[   56.915462]  driver_attach+0x2c/0x38
[   56.918251]  bus_add_driver+0x194/0x1f8
[   56.921036]  driver_register+0x6c/0x128
[   56.923810]  __pci_register_driver+0x4c/0x58
[   56.926581]  amdgpu_init+0x64/0x1000 [amdgpu]
[   56.930075]  do_one_initcall+0x54/0x2c0
[   56.932885]  do_init_module+0x60/0x248
[   56.935722]  load_module+0x2208/0x2758
[   56.938538]  __do_sys_finit_module+0xbc/0xf8
[   56.941353]  __arm64_sys_finit_module+0x28/0x38
[   56.944168]  invoke_syscall+0x4c/0x110
[   56.946980]  el0_svc_common+0x100/0x128
[   56.949777]  do_el0_svc+0x30/0x98
[   56.952569]  el0_svc+0x24/0x38
[   56.955352]  el0t_64_sync_handler+0x90/0xb8
[   56.958128]  el0t_64_sync+0x178/0x17c
[   56.960892] Code: 91010108 54ffff4a 8b040108 cb050042 (d50b7428) 
[   56.963680] ---[ end trace e008c3746cd3d83e ]---

I'll keep up the debugging later. Have to grab dinner for the kid's birthday now. Thanks again for the help today! Maybe I'll be able to plug my way through the entire driver by tomorrow 🤪 — is AMD hiring any Linux driver engineers lately? heh

geerlingguy commented 2 years ago

Now getting stuck on line:

static int amdgpu_device_ip_init(struct amdgpu_device *adev)
...
    r = amdgpu_device_fw_loading(adev);

Checking inside that function where it's dying, probably in hw_init.

Edit: It is in hw_init, specifically, in the jpeg_v3_0.c block's hw_init. Trying to see exactly what line that's on. maybe not. It's in the 3rd IP block, which seems to be navi10_ih_ip_block?

Coreforge commented 2 years ago

https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c#L73 should be here. I'll look through some more code once I get back from school.

V10lator commented 2 years ago

is AMD hiring any Linux driver engineers lately?

I know this was a joke but pinging @johnbridgman anyway. Not only cause of this question but also for upstreaming this work once finished and maybe even for helpful tips about some blockers (as I told @geerlingguy that it might be a good idea to start a thread at the phoronix forums to get attention from AMD and Intel devs. Guess he overlooked that comment I made at some older youtube video) ?

geerlingguy commented 2 years ago

@Coreforge - I already have that line switched to memset_io_pcie().

geerlingguy commented 2 years ago

lol I was off-by-one. I was reading the ip block index as 1, 2, 3, but it's 0, 1, 2, 3. Added some more debugging to identify it is actually the psp block's hw_init that's failing.

                printk(KERN_ALERT "DEBUG: Passed %s %d \n",__FUNCTION__,__LINE__);
                printk(KERN_ALERT "DEBUG: On IP block <%s>\n", adev->ip_blocks[i].version->funcs->name);
                printk(KERN_ALERT "DEBUG: On IP block %d \n",i);
                msleep(500);

                r = adev->ip_blocks[i].version->funcs->hw_init(adev);
[   52.988929] DEBUG: Passed amdgpu_device_fw_loading 2367 
[   52.989008] DEBUG: On IP block <psp>
[   52.989060] DEBUG: On IP block 3 
Coreforge commented 2 years ago

The memset in psp_load_fw seems suspicious

geerlingguy commented 2 years ago

Indeed, we're now blocking on https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c#L2598

    ret = psp_load_fw(adev);
    if (ret) {
        DRM_ERROR("PSP firmware loading failed\n");
        goto failed;
    }

Testing a find-and-replace of memset()'s in that file now. Would the couple instances of memcpy() be suspect too?

Coreforge commented 2 years ago

https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c#L360 this function looks suspicious to me, but I'm not sure. Shouldn't hurt to replace everything though, it's just a bit slower and not as clean, but we can worry about that later.

https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c#L3300 these ones are very likely an issue.

psp_rl_load too.

geerlingguy commented 2 years ago

It's dying on this line: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c#L2517

    ret = amdgpu_bo_create_kernel(adev, PSP_CMD_BUFFER_SIZE, PAGE_SIZE,
                      AMDGPU_GEM_DOMAIN_VRAM,
                      &psp->cmd_buf_bo, &psp->cmd_buf_mc_addr,
                      (void **)&psp->cmd_buf_mem);
    if (ret)
        goto failed;

    memset(psp->fence_buf, 0, PSP_FENCE_BUFFER_SIZE);

    ret = psp_ring_init(psp, PSP_RING_TYPE__KM);
    if (ret) {
        DRM_ERROR("PSP ring init failed!\n");
        goto failed;
    }

Specifically, the memset—note that I changed it to the following and it still hard locks up the system:

    memset_io_pcie(psp->fence_buf, 0, PSP_FENCE_BUFFER_SIZE);

So next step I'm going to disable PSP and see where we get to next.

Coreforge commented 2 years ago

Try setting the module parameter fw_load_type to 0. If the driver is trying to load firmware using the PSP, it should not do that that way. There is no default listed though, so I don't know if it would do anything. (this should prevent issues later that might come from disabling psp)

geerlingguy commented 2 years ago

Disabling PSP got to:

[  113.850288] DEBUG: Passed amdgpu_device_ip_init 2511 
[  118.168910] amdgpu 0000:03:00.0: amdgpu: Msg issuing pre-check failed(0xffffffc2) and SMU may be not in the right state!
[  118.168925] amdgpu 0000:03:00.0: amdgpu: SMC engine is not correctly up!
[  118.168931] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <smu> failed -62
[  118.169336] DEBUG: Passed amdgpu_device_init 3785 
[  118.682207] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
[  118.682221] DEBUG: Passed amdgpu_device_init 3943 
[  119.194202] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
[  119.194214] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[  119.196227] amdgpu: probe of 0000:03:00.0 failed with error -62
[  119.197508] [drm] amdgpu: ttm finalized

It seems like SMU is related to PSP and that causes some issues. But it's nice the thing doesn't entirely lock up :D

For the module parameter, do you mean just add sudo modprobe amdgpu fw_load_type=0 at runtime? (Note: it also looks like there's an ip_block_mask parameter I could use instead of modifying the code to exclude IP blocks like VCN...)

geerlingguy commented 2 years ago

Using fw_load_type=0 got a kernel panic (though it prints to screen output but not to my remote session, so here's a screenshot):

IMG_5167

Looks like it's now dying inside amdgpu_device_indirect_wreg().

geerlingguy commented 2 years ago

It looks like after running through the function a few times successfully as part of nv_common_hw_init, it dies at this line (definitely) during navi10_ih_hw_init (maybe): https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L707

    readl(pcie_data_offset);

And dies with that stack trace.

Coreforge commented 2 years ago

try adding some memory breaks (´__iomb()´) in that function, at least after the last readl. Sometimes too many accesses too quickly can cause crashes (at least it looked like that with fb access with radeon where the pi would crash with too many writes back to back).

geerlingguy commented 2 years ago

@Coreforge - Adding in:

    writel(reg_addr, pcie_index_offset);
    readl(pcie_index_offset);
    __iomb();
    writel(reg_data, pcie_data_offset);
    __iomb();
    readl(pcie_data_offset);
    __iomb();

Doesn't seem to have made a difference :(

Still getting that Kernel panic on amdgpu_device_indirect_wreg+0xb4/0x108 [amdgpu]

Coreforge commented 2 years ago

One thing that kinda worked with the fb issue was instead of writing directly, using memcpy_toio and memcpy_fromio to do it. It's a lot slower, but prevented the SError.

geerlingguy commented 2 years ago

Attempting conversion of those readl/writel lines to __raw_readl() and __raw_writel() results in a kernel panic on _raw_spin_unlock_irqrestore() (the line after those statements).

geerlingguy commented 2 years ago

I was just going to YOLO, but attempting to just not load psp_v11_0_ip_block and smu_v11_0_ip_block, I get the same thing :(

Chlorophytus commented 2 years ago

What phase of GPU init are we? Are we getting VBIOS working? I do know some x86 and Arm assembly so I could try to get started on it.

I'll look into how the VBIOS initializes. The SMU and PSP handle memory/security/etc. SEE: AMD GPUopen

EDIT I think we're in VBIOS stage or enumerating PCI Express. I have some things to do so I might help in a bit.

geerlingguy commented 2 years ago

@Chlorophytus - Thanks!

I'm also pushing up my latest patches to a branch so I can better track changes and current debug process there: https://github.com/geerlingguy/linux/pull/1

geerlingguy commented 2 years ago

And just to leave one more update, it seems like the break happens inside https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L2264

    if (!amdgpu_sriov_vf(adev) || adev->asic_type == CHIP_TONGA)
        r = amdgpu_pm_load_smu_firmware(adev, &smu_version);

And inside that function, it dies on the line https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/pm/amdgpu_dpm.c#L1595:

        r = adev->powerplay.pp_funcs->load_firmware(adev->powerplay.pp_handle);

And the kernel panic points to:

pc : amdgpu_device_indirect_wreg+0x7c/0xb0 [amdgpu]
lr: amdgpu_device_indirect_wreg+0x40/0xb0 [amdgpu]
Coreforge commented 2 years ago

Looking into SError a bit (and assuming the code in the kernel panic is the ESR value), it's hard to say what the exact issue is, as the IDS bit is set, which means the ISS decoding is implementation specific, and I can only find the arm peripherals datasheet for the bcm2711, so unless a pi engineer can look up the decoding, that's not a great way to what the issue is exactly (and it probably wouldn't help directly as we'd only know what the chip doesn't like, but not what exactly triggers it and how to prevent it).

Coreforge commented 2 years ago

You could try changing the amdgpu_device_indirect_wreg to

    __raw_writel(reg_addr, pcie_index_offset);
        dsb();
    __raw_readl(pcie_index_offset);
        dsb();
    __raw_writel(reg_data, pcie_data_offset);
        dsb();
    __raw_readl(pcie_data_offset);
        dsb();

If the compiler complains, use dsb(sy);, though the arm reference says it can be omitted if you want a full system memory break. Another (but only very temporary) option might be to add a printk into that function, that might give it enough time so it doesn't crash, but it would spam dmesg. Adding udelay(1); might also work, but I don't think it did with the fbdev code.

I'm not sure if we're through the bios at this stage, but since the driver is loading various firmware and setting up rings, I think we might be past the bios.

richardkonsky commented 2 years ago

Try flashing the graphics card on a pc using custom firmware that is what some people do to make a pc graphics card work with macs but make it work with pie os

geerlingguy commented 2 years ago

@Coreforge - Still hitting the same kernel panic with that code (compiler didn't complain).

@richardkonsky - I think in that case it's that there exists special firmware to get it working on certain Mac models because a card can be re-flashed to mac vs PC, but in our case we're dealing with PCIe memory access issues on the Pi's system on a chip, that don't seem related to arch differences really.

geerlingguy commented 2 years ago

At the suggestion of someone else who's dealt with some annoying 64-bit bugs on Pi OS in a driver, I'm going to try compiling the graphics drivers on 32-bit Pi OS just to see what happens.

32-bit cross-compile on my M1 Mac:

make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- bcm2711_defconfig
make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- menuconfig  # Add AMDGPU here.
make -j8 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- zImage modules dtbs

Install AMD firmware on the Pi and upgrade the system:

sudo apt-get update
sudo apt-get install -y firmware-amd-graphics
sudo apt-get dist-upgrade -y
sudo reboot

Then, copy the compiled kernel to the Pi, blacklist amdgpu in /etc/modprobe.d/blacklist-amdgpu.conf, reboot and run sudo modprobe amdgpu, and dmesg shows:

TODO

Edit: Hmm... with my custom kernel, the Pi seems to boot partially, then the screen goes black and I get nothing. And I just realized I didn't blacklist the driver. Oops!

Edit 2: And... apparently the Pi won't boot all the way on the 5.14.y branch, regardless. It just hangs after the initial boot period on a black screen. This 32-bit compile wasn't as straightforward as I thought!

Coreforge commented 2 years ago

I've had more luck with 64bit than 32bit, though I only tried 32bit once and switched back to the 64bit version. One thing you could try is to enable AER in drivers -> pci -> aer support. Might show something if things go wrong with pcie (but it may also just spam dmesg with corrected errors).

geerlingguy commented 2 years ago

Well hey... speaking of memory issues: Jon Nettleton may be on to something, with an option to put all buffers into main memory / GTT ( graphics translation table) on Aarch64.

And a thread on lore.kernel.org about framebuffer corruption due to overlapping stp instructions on arm64.

geerlingguy commented 2 years ago

Also on 32-bit 5.10.y, hitting that bug again:

drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c: In function 'amdgpu_dm_atomic_commit_tail':
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7757:4: error: implicit declaration of function 'is_hdr_metadata_different'; did you mean 'is_scaling_state_different'? [-Werror=implicit-function-declaration]
    is_hdr_metadata_different(old_con_state, new_con_state);
    ^~~~~~~~~~~~~~~~~~~~~~~~~
    is_scaling_state_different

I know someone reached out to me about that on Instagram DMs, and I think that's this issue again: https://github.com/raspberrypi/linux/issues/4534

geerlingguy commented 2 years ago

Weird, on 32-bit, when I run sudo modprobe amdgpu, I just see:

[  222.033951] [drm] amdgpu kernel modesetting enabled.

And don't see any of the initialization routine at all. lspci is showing the card just the same as booting in 64-bit, and the dmesg output for the PCIe initialization also looks great, with all the non-IO BARs assigned.

Testing with the RX 550, though, on 32-bit Pi OS, I still get the hard lockup after [ 53.767838] [drm] Chained IB support enabled!. So maybe newer generation cards just don't play nice on 32-bit OSes since nobody really uses those these days for the platforms where you'd use an RX 6700 XT.

genbtc commented 2 years ago

The fact that its "implicit declaration" when its trying to call the is_hdr_metadata_different() actually just seems like a blatant bug in the PI tree, they erased the function, and a missed a spot where it was called. Sloppy.

Since you just want to get it booted at all, and since its first booting up, the obvious choice is to just neglect HDR and continue on. drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7757

        hdr_changed =
            is_hdr_metadata_different(old_con_state, new_con_state);

would become

        hdr_changed =
            drm_connector_atomic_hdr_metadata_equal(old_con_state, new_con_state);

https://github.com/geerlingguy/raspberry-pi-pcie-devices/issues/222#issuecomment-915422493

Looks like it was missed in raspberrypi/linux@6bd4634 which removed is_hdr_metadata_different for the generic helper function drm_connector_atomic_hdr_metadata_equal.

It has to be addressed upstream.

Nevermind, it has been dealt with. https://github.com/raspberrypi/linux/commit/4117cba235d24a7c4630dc38cb55cc80a04f5cf3