bootlin / mali-driver

GNU General Public License v2.0
15 stars 8 forks source link

Linux 6.2 issue #7

Open giuliobenetti opened 1 year ago

giuliobenetti commented 1 year ago

Refer to #6 for further informations.

cbalint13 commented 1 year ago

@giuliobenetti ,

I've tested it on rk3399 NanoPC-T2 with 6.2.9-300.fc38.aarch64 It loads, but probing it with ocl fails with kernel crash, see outputs below. With 6.1.8 kernel all works fine.

Let me know if this 6.x tree can be fixed, I am interested to follow.



# dmesg | grep mali
[   21.078657] mali_kbase: loading out-of-tree module taints kernel.
[   21.082991] mali_kbase: module verification failed: signature and/or required key missing - tainting kernel
[   21.167863] mali ff9a0000.gpu: GPU identified as 0x0860 r2p0 status 0
[   21.168701] mali ff9a0000.gpu: Protected mode not available
[   21.178130] mali ff9a0000.gpu: Probed as mali0
giuliobenetti commented 1 year ago

@cbalint13 is the board really the NanoPC-T2? Because that has Samsung S5P4418 SoC: https://www.friendlyelec.com/index.php?route=product/product&path=69&product_id=103

I think you're running on NanoPC-T4: https://www.friendlyelec.com/index.php?route=product/product&path=69&product_id=225 Right?

Anyway the patch you've proposed looks correct to me. I've committed a patch only for consistency but it doesn't fix anything.

giuliobenetti commented 1 year ago

I've just ordered a RK3399 board that is already supported by Buildroot so I can check and debug it.

cbalint13 commented 1 year ago

I think you're running on NanoPC-T4: https://www.friendlyelec.com/index.php?route=product/product&path=69&product_id=225 Right?

Anyway the patch you've proposed looks correct to me. I've committed a patch only for consistency but it doesn't fix anything.

cbalint13 commented 1 year ago

@giuliobenetti ,

I've just ordered a RK3399 board that is already supported by Buildroot so I can check and debug it.

giuliobenetti commented 1 year ago

@giuliobenetti ,

I've just ordered a RK3399 board that is already supported by Buildroot so I can check and debug it.

  • Let me know if need a "quick" basic/minimal sdcard image: efi+grub+kernel-6.x (it is not that easy to get efi+grub+6.x)
  • The rootfs (btrfs) can be anything you would like, I can bring by default a fedora one, you can replace it after.

Thanks a lot. For the moment I’ve built Buildroot. But I need to wait for the board to arrive, I’ve ordered it 2 hours ago :-)

giuliobenetti commented 1 year ago

@cbalint13 can you please point me the URL of the mali blob you're using? So I can setup my Buildroot system correctly. At the moment I'm using G31 blob without version checking, but the repository I pick the blob from link only has blob for r18p0 version and it would fail against r8p0 driver because of version checking. You should have a blob without version checking.

Thanks in advance for helping!

cbalint13 commented 1 year ago

@giuliobenetti ,

@cbalint13 can you please point me the URL of the mali blob you're using? So I can setup my Buildroot system correctly. At the moment I'm using G31 blob without version checking, but the repository I pick the blob from link only has blob for r18p0 version and it would fail against r8p0 driver because of version checking. You should have a blob without version checking.


In short this is exposed to system:

# rpm -ql libmali
/etc/OpenCL/vendors/mali.icd
/usr/lib/.build-id
/usr/lib/.build-id/67
/usr/lib/.build-id/67/22e723d65ca9ddbf0d0e14af3ce769718f9f6c
/usr/lib64/libGLES_mali.so
/usr/lib64/libMaliOpenCL.so
/usr/lib64/libmali.so
/usr/share/licenses/libmali
/usr/share/licenses/libmali/END_USER_LICENCE_AGREEMENT.txt

# cat /etc/OpenCL/vendors/mali.icd
libMaliOpenCL.so

Excerpt from build receipt:

cp -Pf lib/aarch64-linux-gnu/libmali-midgard-t86x-r18p0-x11.so %{buildroot}/%{_libdir}/libMaliOpenCL.so
ln -s libMaliOpenCL.so libmali.so
ln -s libMaliOpenCL.so libGLES_mali.so

Thanks in advance for helping!

Let me know if need more details toward reproducibility.

cbalint13 commented 1 year ago

@giuliobenetti ,

Double checked, It seems that is library independent, r8p-kernel-drv works with v1.r18p0-01rel0.5cb5681058e8e076ff89747c20c32578 :

# ./clDeviceQuery 
clDeviceQuery Starting...

arm_release_ver of this libmali is 'r18p0-01rel0', rk_so_ver is '4'.1 OpenCL Platforms found

 CL_PLATFORM_NAME:  ARM Platform
 CL_PLATFORM_VERSION:   OpenCL 1.2 v1.r18p0-01rel0.5cb5681058e8e076ff89747c20c32578
OpenCL Device Info:

 1 devices found supporting OpenCL on: ARM Platform

 ----------------------------------
 Device Mali-T860
 ---------------------------------
  CL_DEVICE_NAME:           Mali-T860
  CL_DEVICE_VENDOR:             ARM
  CL_DRIVER_VERSION:            1.2
cbalint13 commented 1 year ago

@giuliobenetti ,

Triple checked,


Failed test with r18 library:

# ./clDeviceQuery 
clDeviceQuery Starting...

arm_release_ver of this libmali is 'r18p0-01rel0', rk_so_ver is '4'.1 OpenCL Platforms found

 CL_PLATFFORM_VERSION:          OpenCL 1.2 v1.r18p0-01rel0.5cb5681058e8e076ff89747c20c32578

# ./ocl-test 
[   52.944133] mali ff9a0000.gpu: Stride passed to job_submit doesn't match kernel

Passed test with r14 library:

# ./clDeviceQuery                                                                                                                                                                                                          
clDeviceQuery Starting...

1 OpenCL Platforms found

 CL_PLATFORM_NAME:              ARM Platform
 CL_PLATFORM_VERSION:           OpenCL 1.2 v1.r14p0-01rel0-git(a79caef).8ddfd7584149d9238dced4e406610de7
OpenCL Device Info:

# ./ocl-test 
2.000000 * 0.000000 + 1024.000000 = 1024.000000
2.000000 * 1.000000 + 1023.000000 = 1025.000000
2.000000 * 2.000000 + 1022.000000 = 1026.000000
2.000000 * 3.000000 + 1021.000000 = 1027.000000
2.000000 * 4.000000 + 1020.000000 = 1028.000000
2.000000 * 5.000000 + 1019.000000 = 1029.000000
giuliobenetti commented 1 year ago

@cbalint13 thanks a lot for all the tests. But I’m a bit confused. Can you please summarize which version works against this driver pointing also the url of the blob and all the logs? In the beginning you’ve pointed me a segfault but now I don’t see it anymore, so can you explain the relationship with the segfault?

Thanks a lot!

cbalint13 commented 1 year ago

@giuliobenetti ,

@cbalint13 thanks a lot for all the tests. But I’m a bit confused. Can you please summarize which version works against this driver pointing also the url of the blob and all the logs?

  1. So, r14 userland library works (detection + any-ocl-kernels) with your 6.1 vanilla branch on a 6.1.8 kernel.
  2. Now, for the 6.2 kernel the load of gpu driver crashes, nothing works, see the first comment for 6.2 case.
  3. The URL for r14-midgard lib: https://github.com/ariaboard-com/rockchip_libmali/tree/master/lib/aarch64-linux-gnu

In the beginning you’ve pointed me a segfault but now I don’t see it anymore, so can you explain the relationship with the segfault?

Thanks a lot!

mrfixit2001 commented 1 year ago

@giuliobenetti I can confirm this still exists on 6.5, been working to resolve and have tried a few different variations of the driver and patches

[  617.870232] WARNING: CPU: 4 PID: 2952 at mm/memory.c:5185 handle_mm_fault+0x1f0/0x210
[  617.870924] Modules linked in: 8021q btsdio hci_uart btqca btusb btrtl btbcm btintel bluetooth ecdh_generic ecc ir_rcmm_decoder ir_imon_decoder ir_xmp_decoder ir_mce_kbd_decoder ir_sharp_decoder ir_sanyo_decoder ir_sony_decoder ir_jvc_decoder ir_rc6_decoder ir_nec_decoder ir_rc5_decoder fusb302 tcpm rk_crypto spi_rockchip pwm_fan rk3399_dmc crypto_engine
[  617.873694] CPU: 4 PID: 2952 Comm: emulationstatio Not tainted 6.5.0 #37
[  617.874280] Hardware name: Pine64 RockPro64 v2.1 (DT)
[  617.874721] pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  617.875329] pc : handle_mm_fault+0x1f0/0x210
[  617.875704] lr : do_page_fault+0x1b0/0x444
[  617.876064] sp : ffff800085a6bdb0
[  617.876355] x29: ffff800085a6bdb0 x28: ffff0000f367a880 x27: 0000000000000000
[  617.876981] x26: ffff0000f367a880 x25: 0000000000000002 x24: 00000000f7e2f000
[  617.877607] x23: 000000009200004f x22: ffff800085a6beb0 x21: ffff000006ea63c0
[  617.878231] x20: 00000000f7e2f000 x19: 0000000000000255 x18: ffff800085a6bda8
[  617.878857] x17: 0000000000000000 x16: ffff8000811ed598 x15: 00000000f7e6efff
[  617.879481] x14: 00000000f7d69000 x13: 1fffe00001ea0261 x12: ffff800085a6bd48
[  617.880107] x11: ffff00000f501300 x10: ffff00000f50130c x9 : ffff00000f501308
[  617.880732] x8 : 00000000f7e2f000 x7 : 00000000f7e2f000 x6 : ffff00000f501380
[  617.881356] x5 : 0000000000000006 x4 : ffff0000f367a880 x3 : ffff800085a6beb0
[  617.881981] x2 : 0000000000000255 x1 : 00000000040644cb x0 : ffff000006c43960
[  617.882607] Call trace:
[  617.882822]  handle_mm_fault+0x1f0/0x210
[  617.883168]  do_page_fault+0x1b0/0x444
[  617.883497]  do_mem_abort+0x40/0x8c
[  617.883804]  el0_da+0x20/0x54
[  617.884070]  el0t_32_sync_handler+0xf4/0x114
[  617.884445]  el0t_32_sync+0x150/0x154
[  617.884768] ---[ end trace 0000000000000000 ]---

I'm in a 32-bit userland with a 64-bit kernel, which is different than the issue author, but have the same error. Seems an upstream change in 6.2 has triggered this memory incompatibility.

I'm happy to test anything you send!

giuliobenetti commented 1 year ago

@mrfixit2001 @cbalint13 I'm very sorry I still haven't found time to fix this issue.

@mrfixit2001 I agree with you, it seems like a memory incompatibility and it looks the same as @cbalint13 has pointed above.

@mrfixit2001 Are you using OpenCL or OpenGL Userspace Blobs? This can help me to address the problem.

mrfixit2001 commented 1 year ago

@giuliobenetti appreciate the quick reply!

I'm testing GLES / GBM using a RK3399 Midgard

giuliobenetti commented 1 year ago

@giuliobenetti appreciate the quick reply!

I'm testing GLES / GBM using a RK3399 Midgard

Ok, so this is a common problem between both OpenGL and OpenCL. I'd need a longer backtrace. Would it be possible for you to use Ftrace?

giuliobenetti commented 1 year ago

@mrfixit2001 @cbalint13 could you please give a try to branch https://github.com/giuliobenetti/mali-driver/tree/test/fix-6.2%2B and see if that fixes the runtime failure?

Thanks a lot!

mrfixit2001 commented 1 year ago

@giuliobenetti unfortunately that patch does not resolve. Same failure output.

giuliobenetti commented 1 year ago

@giuliobenetti unfortunately that patch does not resolve. Same failure output.

Ok, thanks for testing. That patch is needed for consistency in any case so I will commit it later.

@mrfixit2001 would it be possible for you to issue a ftrace on modprobe?

I will do my best to bring up a board to debug such bug.

mrfixit2001 commented 1 year ago

@giuliobenetti I am compiling midgard as built-in rather than as a module, but I will see about adding ftrace.

I've been staring at this code a few days now... Could this possibly be due to the reimplementation of kbase_unmapped_area_topdown? It coincidentally changed right around that same time to use a maple tree instead of rbtree. 3499a13168da6a0c122c70f24e653b650d18c882

mrfixit2001 commented 1 year ago

@giuliobenetti

Attached is a function-graph trace of attempting to start my application. Please let me know what other debug detail you require. And thanks again for your time and involvement!

trace.txt

mrfixit2001 commented 1 year ago

@giuliobenetti I enabled some additional mali tracing in the kernel, not sure if this helps more or not but here is another trace. To be clear, the driver probes fine, it fails when being used.

trace.txt

[    2.271101] mali ff9a0000.gpu: GPU identified as 0x0860 r2p0 status 0
[    2.271805] mali ff9a0000.gpu: Protected mode not available
[    2.272679] mali ff9a0000.gpu: Continuing without devfreq
[    2.273615] mali ff9a0000.gpu: Probed as mali0
mrfixit2001 commented 1 year ago

@giuliobenetti

In case the function-graph isn't what you wanted, here is a new function trace instead

function-trace.zip

giuliobenetti commented 1 year ago

@giuliobenetti I enabled some additional mali tracing in the kernel, not sure if this helps more or not but here is another trace. To be clear, the driver probes fine, it fails when being used.

trace.txt

[    2.271101] mali ff9a0000.gpu: GPU identified as 0x0860 r2p0 status 0
[    2.271805] mali ff9a0000.gpu: Protected mode not available
[    2.272679] mali ff9a0000.gpu: Continuing without devfreq
[    2.273615] mali ff9a0000.gpu: Probed as mali0

Thank you but this is a normal behavior, the driver works even without devfreq.

giuliobenetti commented 1 year ago

@giuliobenetti

In case the function-graph isn't what you wanted, here is a new function trace instead

function-trace.zip

Yes, this is close to what I need, but I'd need the stackframe on segault including the last mali driver calls. Anyway, even if that could help, it's not that easy. I'm setting up the bsp and debug environment with TRACE32. It takes some time. Once I have news I will post here, hopefully with a fix. If you could produce the stackframe of mali driver up to the segfault it would help.

Thank you!

mrfixit2001 commented 1 year ago

FYI - I just tested with the bleeding edge commit from torvalds, same error.

Here's the full GDB backtrace:

Thread 1 "emulationstatio" received signal SIGSEGV, Segmentation fault.
0xf45616d0 in memset () from /lib/libc.so.6
(gdb) thread apply all bt

Thread 9 (Thread 0xef1c7f80 (LWP 3069) "mali-cmar-backe"):
#0  0xf45c3558 in poll () from /lib/libc.so.6
#1  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 8 (Thread 0xef9c8f80 (LWP 3068) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 7 (Thread 0xf01c9f80 (LWP 3067) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 6 (Thread 0xf09caf80 (LWP 3066) "mali-utility-wo"):
--Type <RET> for more, q to quit, c to continue without paging--
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 5 (Thread 0xf11cbf80 (LWP 3065) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 4 (Thread 0xf19ccf80 (LWP 3064) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

--Type <RET> for more, q to quit, c to continue without paging--
Thread 3 (Thread 0xf21cdf80 (LWP 3063) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 2 (Thread 0xf29cef80 (LWP 3062) "mali-mem-purge"):
#0  0xf45867ac in __clock_nanosleep_time64 () from /lib/libc.so.6
#1  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0xf7fb0280 (LWP 3018) "emulationstatio"):
#0  0xf45616d0 in memset () from /lib/libc.so.6
#1  0xf50d6a14 in ?? () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
giuliobenetti commented 1 year ago

FYI - I just tested with the bleeding edge commit from torvalds, same error.

Here's the full GDB backtrace:

Thread 1 "emulationstatio" received signal SIGSEGV, Segmentation fault.
0xf45616d0 in memset () from /lib/libc.so.6
(gdb) thread apply all bt

Thread 9 (Thread 0xef1c7f80 (LWP 3069) "mali-cmar-backe"):
#0  0xf45c3558 in poll () from /lib/libc.so.6
#1  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 8 (Thread 0xef9c8f80 (LWP 3068) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 7 (Thread 0xf01c9f80 (LWP 3067) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 6 (Thread 0xf09caf80 (LWP 3066) "mali-utility-wo"):
--Type <RET> for more, q to quit, c to continue without paging--
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 5 (Thread 0xf11cbf80 (LWP 3065) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 4 (Thread 0xf19ccf80 (LWP 3064) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

--Type <RET> for more, q to quit, c to continue without paging--
Thread 3 (Thread 0xf21cdf80 (LWP 3063) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 2 (Thread 0xf29cef80 (LWP 3062) "mali-mem-purge"):
#0  0xf45867ac in __clock_nanosleep_time64 () from /lib/libc.so.6
#1  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0xf7fb0280 (LWP 3018) "emulationstatio"):
#0  0xf45616d0 in memset () from /lib/libc.so.6
#1  0xf50d6a14 in ?? () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

@mrfixit2001 Thank you for the effort! This is the backtrace of the userspace, so the functions I see are the blobs one. I'm a bit confused now, so is the driver that panics the kernel? Or is the application that panics the kernel using the driver? I mean, if you modprobe this driver does it spit out that segfault? Or does it show that segfault while executing an application linked with libmali.so?

giuliobenetti commented 1 year ago

@mrfixit2001 Ok, finally I have Rockpro64-V2 up and running where I have RK3399 with Mali-T860. modprobe mali_kbase works correctly so this is something that is triggered from blob, not easy, but I can catch it by entering debugging, not an easy and fast task. I will let you know once done.

giuliobenetti commented 1 year ago

@mrfixit2001 I've reproduced the error with the same board you have. I've straced glmark2-es2-drm and it dies here:

ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x82, 0, 0x38), 0xffffe2233b20) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0x41000) = 0xffff95b5c000
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffff95b5c000} ---
+++ killed by SIGSEGV +++

Need to dig. I will find an easier test program so I have less function calls. Unfortunately on RK3399 they share SWD lines with sd-card and so I should setup a nfsroot to connect with debugger... I will keep you updated.

mrfixit2001 commented 1 year ago

Thank you for keeping us updated!! I’m excited you’re able to reproduce and am hopeful you’ll find a fix soon. I don’t mind patching DRM instead of mali if that’s needed. Looking forward to your reply.

mrfixit2001 commented 1 year ago

@giuliobenetti any progress? Would you mind sharing the full trace/dump so any of us interested can try and help? Side note - is the KDS DMA patch included in this repo required for proper midgard functionality in the modern kernel?

giuliobenetti commented 1 year ago

@giuliobenetti any progress? Would you mind sharing the full trace/dump so any of us interested can try and help? Side note - is the KDS DMA patch included in this repo required for proper midgard functionality in the modern kernel?

I still had no time to enter debugging, so I neither have the full trace. I have to put my hands on this soon. What do you mean with KDS DMS patch? Can you elaborate?

mrfixit2001 commented 1 year ago

@giuliobenetti any progress? Would you mind sharing the full trace/dump so any of us interested can try and help? Side note - is the KDS DMA patch included in this repo required for proper midgard functionality in the modern kernel?

I still had no time to enter debugging, so I neither have the full trace. I have to put my hands on this soon. What do you mean with KDS DMS patch? Can you elaborate?

Thanks for the update, looking forward to hearing back. Regarding the KDS DMS patch - I'm referring to this: (https://github.com/bootlin/mali-driver/blob/master/r8p0/patches/integrate_kds_with_dma_buf.patch)

mrfixit2001 commented 1 year ago

I was able to get around the shown error by fixing the way the vma flags are cleared in kbase_mmap. The << 4 is no longer correct. Now there's a DMA fence issue:


virtual address 0000000000000010
[   60.771002] Mem abort info:
[   60.771248]   ESR = 0x0000000096000007
[   60.771578]   EC = 0x25: DABT (current EL), IL = 32 bits
[   60.772044]   SET = 0, FnV = 0
[   60.772335]   EA = 0, S1PTW = 0
[   60.772612]   FSC = 0x07: level 3 translation fault
[   60.773040] Data abort info:
[   60.773293]   ISV = 0, ISS = 0x00000007, ISS2 = 0x00000000
[   60.773773]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[   60.774216]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[   60.774682] user pgtable: 4k pages, 48-bit VAs,
pgdp=00000000f33be000
[   60.775244] [0000000000000010] pgd=0800000006c09003,
p4d=0800000006c09003, pud=080000001160a003, pmd=08000000129
43003, pte=0000000000000000
[   60.776356] Internal error: Oops: 0000000096000007 [#1] SMP
[   60.776847] Modules linked in: 8021q btsdio hci_uart btqca btusb
btrtl btbcm btintel bluetooth ecdh_generic ecc
ir_rcmm_decoder ir_imon_decoder ir_xmp_decoder ir_mce_kbd_decoder
ir_sharp_decoder ir_sanyo_decoder ir_sony_decoder
 ir_jvc_decoder ir_rc6_decoder ir_nec_decoder ir_rc5_decoder fusb302
tcpm rk_crypto pwm_fan spi_rockchip rk3399_dmc
 crypto_engine
[   60.779635] CPU: 0 PID: 2691 Comm: mali-cmar-backe Not tainted
6.5.0 #14
[   60.780223] Hardware name: Pine64 RockPro64 v2.1 (DT)
[   60.780666] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS
BTYPE=--)
[   60.781277] pc : dma_resv_add_fence+0x7c/0x21c
[   60.781679] lr : kbase_dma_fence_wait+0x170/0x3d4
[   60.782097] sp : ffff800084af3ab0
[   60.782388] x29: ffff800084af3ab0 x28: ffff800084651000 x27:
ffff000007e33600
[   60.783018] x26: 0000000103115001 x25: 0000000000000000 x24:
0000000000000000
[   60.783647] x23: 0000000000000001 x22: ffff0000032c4300 x21:
0000000000000000
[   60.784276] x20: 0000000000000000 x19: ffff0000061b3100 x18:
0000000000000000
[   60.784904] x17: 0000000000000000 x16: 0000000000000000 x15:
0000000000000002
[   60.785533] x14: 0000000000000001 x13: 00000000000da51e x12:
0000000000000048
[   60.786161] x11: 00000000000007e8 x10: ffff800084bedaa8 x9 :
ffff0000025bbf00
[   60.786789] x8 : ffff0000032c4340 x7 : 0000000000000000 x6 :
0000000000000000
[   60.787418] x5 : ffff0000032c4310 x4 : 0000000000000001 x3 :
0000000000000000
[   60.788047] x2 : ffff80008105f5e0 x1 : ffff80008106ce30 x0 :
ffff80008106ce80
[   60.788677] Call trace:
[   60.788894]  dma_resv_add_fence+0x7c/0x21c
[   60.789256]  kbase_dma_fence_wait+0x170/0x3d4
[   60.789640]  jd_submit_atom+0x888/0x9a4
[   60.789981]  kbase_jd_submit+0x214/0x348
[   60.790328]  kbase_ioctl+0xb6c/0x157c
[   60.790655]  __arm64_compat_sys_ioctl+0x140/0x160
[   60.791074]  invoke_syscall+0x44/0x108
[   60.791411]  el0_svc_common.constprop.0+0x40/0xd8
[   60.791827]  do_el0_svc_compat+0x18/0x38
[   60.792175]  el0_svc_compat+0x14/0x48
[   60.792505]  el0t_32_sync_handler+0x88/0x114
[   60.792881]  el0t_32_sync+0x150/0x154
[   60.793210] Code: fa401044 54000041 d4210000 f9401678 (b9401319)
[   60.793744] ---[ end trace 0000000000000000 ]---           ```

On Mon, Oct 23, 2023 at 2:21 PM Giulio Benetti ***@***.***>
wrote:

> @giuliobenetti <https://github.com/giuliobenetti> any progress? Would you
> mind sharing the full trace/dump so any of us interested can try and help?
> Side note - is the KDS DMA patch included in this repo required for proper
> midgard functionality in the modern kernel?
>
> I still had no time to enter debugging, so I neither have the full trace.
> I have to put my hands on this soon.
> What do you mean with KDS DMS patch? Can you elaborate?
>
> —
> Reply to this email directly, view it on GitHub
> <https://github.com/bootlin/mali-driver/issues/7#issuecomment-1775764674>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AJIK7DCOURCEZBWFMRT6GI3YA2YRTAVCNFSM6AAAAAAWTDPD6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZVG43DINRXGQ>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
giuliobenetti commented 1 year ago

Hi @mrfixit2001,

The << 4 is no longer correct.

can you elaborate more? Is there a Linux commit that requires that shift to be changed? If yes please open a PR documenting the reason. Thank you!

mrfixit2001 commented 1 year ago

@giuliobenetti I've created the PR for you. Feel free to edit if you'd like ofc. The change I've made should also be more future-proof without the shift.

https://github.com/bootlin/mali-driver/pull/8

I would appreciate any insight you may have into the new error I'm getting, the NULL pointer dereference

mrfixit2001 commented 1 year ago

@giuliobenetti FYI - the new null reference error is being thrown because dma_resv_fences_list is returning NULL... which ultimately means __rcu_dereference_check is returning null. When I add a check to dma-resv that checks for this NULL I can bypass the error but then either the card locks up without throwing a panic or the mali driver throws errors about job hard stops, failures, and faults.

giuliobenetti commented 1 year ago

@giuliobenetti FYI - the new null reference error is being thrown because dma_resv_fences_list is returning NULL... which ultimately means __rcu_dereference_check is returning null. When I add a check to dma-resv that checks for this NULL I can bypass the error but then either the card locks up without throwing a panic or the mali driver throws errors about job hard stops, failures, and faults.

@mrfixit2001 Can you please give a fast try with the 2 changes below and let me know the result? Enable CONFIG_DRM_FBDEV_LEAK_PHYS_SMEM, and pass `drm_kms_helper.drm_leak_fbdev_smem=1' to Linux bootargs. I still have to enter debug so I'm pretty blind. Feel free to check my previous commits too, I could have inserted a regression for that NULL return.

mrfixit2001 commented 1 year ago

@giuliobenetti Thanks for the Idea, but unfortunately that does not change the results, same error observed. Please let me know if you have any other ideas. For now I am testing with CONFIG_MALI_DMA_FENCE disabled, but I am then getting IOMMU errors in the VOP. So I expect the DMA fence is going to be required and must be fixed.

On a completely different topic, this will also need to be adjusted for newer kernels: https://github.com/bootlin/mali-driver/blob/master/r8p0/drivers/gpu/arm/midgard/ipa/mali_kbase_ipa.c#L577

After this commit: https://github.com/torvalds/linux/commit/615510fe13bd2434610193f1acab53027d5146d6

giuliobenetti commented 1 year ago

@giuliobenetti Thanks for the Idea, but unfortunately that does not change the results, same error observed. Please let me know if you have any other ideas. For now I am testing with CONFIG_MALI_DMA_FENCE disabled, but I am then getting IOMMU errors in the VOP. So I expect the DMA fence is going to be required and must be fixed.

Ok, it was only a fast try.

On a completely different topic, this will also need to be adjusted for newer kernels: https://github.com/bootlin/mali-driver/blob/master/r8p0/drivers/gpu/arm/midgard/ipa/mali_kbase_ipa.c#L577

After this commit: torvalds/linux@615510f

Can you open a PR for that? Can you also create commit log like I've done for previous commits? Check here: https://github.com/bootlin/mali-driver/commit/c90627f78d58567a2acb7cbf77d565e03a131294

Thanks you!

mrfixit2001 commented 1 year ago

@giuliobenetti

I have refactored and adjusted dma fence to always call dma_resv_reserve_fences for both read and write dma reservations, and I've updated kbase_dma_fence_lock_reservations to use dma resv locks instead of ww mutex's (this mirrors changes done to the DRM drivers). This resolves the null error... but now I'm still stuck with iommu errors. And if the board doesn't lock up afterwards then I'm given a bunch of mali errors.

Here's the next error to be investigated...

[   20.339631] rk_iommu ff903f00.iommu: Enable stall request timed out, status: 0x000001
[   20.341404] rk_iommu ff903f00.iommu: Disable paging request timed out, status: 0x000001
[   20.348044] ------------[ cut here ]------------
[   20.348888] WARNING: CPU: 1 PID: 672 at drivers/iommu/iommu.c:122 iommu_detach_device+0xb4/0xbc
[   20.349662] Modules linked in: btsdio hci_uart btqca btusb btrtl btbcm btintel bluetooth ecdh_generic ecc ir_rcmm_decoder ir_imon_decoder ir_xmp_decoder ir_mce_kbd_decoder ir_sharp_decoder ir_sanyo_decoder ir_sony_decoder ir_jvc_decoder ir_rc6_decoder ir_nec_decoder ir_rc5_decoder fusb302 rk_crypto tcpm pwm_fan spi_rockchip rk3399_dmc crypto_engine
[   20.352406] CPU: 1 PID: 672 Comm: ffplay Not tainted 6.5.0 #26
[   20.352918] Hardware name: Pine64 RockPro64 v2.1 (DT)
[   20.353361] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   20.353972] pc : iommu_detach_device+0xb4/0xbc
[   20.354366] lr : iommu_detach_device+0x80/0xbc
[   20.354758] sp : ffff80008517b900
[   20.355050] x29: ffff80008517b900 x28: 0000000000000000 x27: ffff800081038c58
[   20.355679] x26: ffff00000ba6b400 x25: ffff800081a5c25f x24: ffff800081478e50
[   20.356308] x23: 0000000000000038 x22: ffff0000003dd800 x21: ffff000002f51468
[   20.356937] x20: ffff00000327b6a8 x19: ffff000002f51400 x18: 0000000000000030
[   20.357565] x17: 7574617473202c74 x16: 756f2064656d6974 x15: 2074736575716572
[   20.358194] x14: ffff800081940dd8 x13: 0000000000000684 x12: 000000000000022c
[   20.358821] x11: 202c74756f206465 x10: ffff800081998dd8 x9 : 00000000fffff000
[   20.359450] x8 : ffff800081940dd8 x7 : ffff800081998dd8 x6 : 0000000000000001
[   20.360078] x5 : ffff0000025b99c0 x4 : 0000000000000000 x3 : 0000000000000001
[   20.360706] x2 : 0000000000000001 x1 : 0000000000000005 x0 : 00000000ffffff92
[   20.361336] Call trace:
[   20.361552]  iommu_detach_device+0xb4/0xbc
[   20.361916]  rockchip_drm_dma_detach_device+0x18/0x24
[   20.362367]  vop_crtc_atomic_disable+0x264/0x388
[   20.362774]  disable_outputs+0x22c/0x338
[   20.363122]  drm_atomic_helper_commit_tail_rpm+0x20/0x98
[   20.363590]  commit_tail+0x9c/0x164
[   20.363900]  drm_atomic_helper_commit+0x144/0x170
[   20.364315]  drm_atomic_commit+0xa4/0x100
[   20.364673]  drm_atomic_helper_set_config+0x9c/0xec
[   20.365102]  drm_mode_setcrtc+0x1a8/0x6c4
[   20.365457]  drm_ioctl_kernel+0xbc/0x164
[   20.365805]  drm_ioctl+0x214/0x4bc
[   20.366107]  drm_compat_ioctl+0x10c/0x120
[   20.366463]  __arm64_compat_sys_ioctl+0x140/0x160
[   20.366881]  invoke_syscall+0x44/0x108
[   20.367217]  el0_svc_common.constprop.0+0x40/0xd8
[   20.367633]  do_el0_svc_compat+0x18/0x38
[   20.367982]  el0_svc_compat+0x14/0x48
[   20.368311]  el0t_32_sync_handler+0x88/0x114
[   20.368690]  el0t_32_sync+0x150/0x154
[   20.369016] ---[ end trace 0000000000000000 ]---
mrfixit2001 commented 1 year ago

@giuliobenetti

Additional detail - the above iommu error is thrown after attempting to exit ffplay which uses SDL2. The audio plays but the video is blank and then errors out after.

I get a completely different error when I try to start KODI - which does NOT use SDL2 at all - it sends all it's display directly to drm / gbm. The below basically repeats over and over:

[  775.614898] mali ff9a0000.gpu: AS_ACTIVE bit stuck
[  775.615339] mali ff9a0000.gpu: Flush for GPU page table update did not complete. Issueing GPU soft-reset to recover
[  775.616264] mali ff9a0000.gpu: Preparing to soft-reset GPU: Waiting (upto 3000 ms) for all jobs to complete soft-stop
[  775.737509] mali ff9a0000.gpu: AS_ACTIVE bit stuck
[  775.757568] mali ff9a0000.gpu: AS_ACTIVE bit stuck
[  775.777625] mali ff9a0000.gpu: AS_ACTIVE bit stuck
[  775.778047] mali ff9a0000.gpu: Flush for GPU page table update did not complete. Issueing GPU soft-reset to recover
[  775.898797] mali ff9a0000.gpu: AS_ACTIVE bit stuck
[  775.918893] mali ff9a0000.gpu: AS_ACTIVE bit stuck
[  775.938984] mali ff9a0000.gpu: AS_ACTIVE bit stuck
[  775.939405] mali ff9a0000.gpu: Flush for GPU page table update did not complete. Issueing GPU soft-reset to recover
[  778.617254] mali ff9a0000.gpu: Resetting GPU (allowing up to 500 ms)
[  778.617821] mali ff9a0000.gpu: Register state:
[  778.618212] mali ff9a0000.gpu:   GPU_IRQ_RAWSTAT=0x00000200 GPU_STATUS=0x00000009
[  778.618867] mali ff9a0000.gpu:   JOB_IRQ_RAWSTAT=0x00000000 JOB_IRQ_JS_STATE=0x00000002
[  778.619580] mali ff9a0000.gpu:   JS0_STATUS=0x00000000      JS0_HEAD_LO=0x00000000
[  778.620244] mali ff9a0000.gpu:   JS1_STATUS=0x00000008      JS1_HEAD_LO=0xf63c8500
[  778.620906] mali ff9a0000.gpu:   JS2_STATUS=0x00000000      JS2_HEAD_LO=0x00000000
[  778.621568] mali ff9a0000.gpu:   MMU_IRQ_RAWSTAT=0x00000000 GPU_FAULTSTATUS=0x00000000
[  778.622260] mali ff9a0000.gpu:   GPU_IRQ_MASK=0x00000000    JOB_IRQ_MASK=0x00000000     MMU_IRQ_MASK=0x00000000
[  778.623150] mali ff9a0000.gpu:   PWR_OVERRIDE0=0x00000000   PWR_OVERRIDE1=0x00000000
[  778.623827] mali ff9a0000.gpu:   SHADER_CONFIG=0x00010000   L2_MMU_CONFIG=0x00000000
[  778.624504] mali ff9a0000.gpu:   TILER_CONFIG=0x00000001    JM_CONFIG=0x00000038
[  778.625175] mali ff9a0000.gpu: t6xx: GPU fault 0x4002 from job slot 1
[  779.125182] mali ff9a0000.gpu: Failed to soft-reset GPU (timed out after 500 ms), now attempting a hard reset
[  779.126104] mali ff9a0000.gpu: Reset complete
[  779.126550] mali ff9a0000.gpu: t6xx: GPU fault 0x4002 from job slot 0
[  784.265095] mali ff9a0000.gpu: JS: Job Hard-Stopped (took more than 50 ticks at 100 ms/tick)
[  784.765857] mali ff9a0000.gpu: JS: Job has been on the GPU for too long (JS_RESET_TICKS_SS/DUMPING timeout hit). Issueing GPU soft-reset to resolve.

Any insight or ideas is welcome.

mrfixit2001 commented 1 year ago

I'll give you one more error to reference as well... Let me know if you have any ideas based on any of this...

I tried to increased KBASE_AS_INACTIVE_MAX_LOOPS so it wouldn't think the gpu was stuck, but that either had no effect or caused the board to simply lock up without throwing an error...

But if I first trigger the iommu error with ffplay, and THEN try and start kodi, I get yet a completely different error :)

[  106.562177] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  106.562968] Mem abort info:
[  106.563214]   ESR = 0x0000000096000007
[  106.563544]   EC = 0x25: DABT (current EL), IL = 32 bits
[  106.564008]   SET = 0, FnV = 0
[  106.564277]   EA = 0, S1PTW = 0
[  106.564554]   FSC = 0x07: level 3 translation fault
[  106.564982] Data abort info:
[  106.565235]   ISV = 0, ISS = 0x00000007, ISS2 = 0x00000000
[  106.565715]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  106.566158]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  106.566623] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000168b0000
[  106.567186] [0000000000000000] pgd=08000000168a8003, p4d=08000000168a8003, pud=0800000007197003, pmd=08000000168bd003, pte=0000000000000000
[  106.568288] Internal error: Oops: 0000000096000007 [#1] SMP
[  106.568776] Modules linked in: 8021q btsdio hci_uart btqca btusb btrtl btbcm btintel bluetooth ecdh_generic ecc fusb302 tcpm rk_crypto spi_rockchip pwm_fan rk3399_dmc crypto_engine
[  106.570218] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G     U  W          6.5.0 #39
[  106.570867] Hardware name: Pine64 RockPro64 v2.1 (DT)
[  106.571310] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  106.571921] pc : timekeeping_advance+0x7c/0x558
[  106.572332] lr : update_wall_time+0x10/0x2c
[  106.572704] sp : ffff800081923b10
[  106.572996] x29: ffff800081923b10 x28: ffff800081931600 x27: 0000000000000400
[  106.573625] x26: ffff800081aeb000 x25: ffff0000f77bff80 x24: 0000000000000000
[  106.574255] x23: ffff0000f7750848 x22: 00000018cf79f264 x21: ffff800081aebc00
[  106.574883] x20: ffff800081926000 x19: 0000000000000001 x18: 0000000000000000
[  106.575511] x17: 0000000000000000 x16: 0000000000000000 x15: 00000000000003f0
[  106.576139] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000001
[  106.576767] x11: 0000000000000002 x10: 0000000000000960 x9 : ffff800081926000
[  106.577395] x8 : 00000000000000c0 x7 : 00000000ffff216e x6 : ffff800081ac9a30
[  106.578025] x5 : 00ffffffffffffff x4 : ffff800081aebd28 x3 : 0000000000000000
[  106.578653] x2 : 0000000000000001 x1 : 0000000000000000 x0 : 0000000000000000
[  106.579283] Call trace:
[  106.579499]  timekeeping_advance+0x7c/0x558
[  106.579871]  update_wall_time+0x10/0x2c
[  106.580211]  tick_do_update_jiffies64+0xe4/0x150
[  106.580623]  tick_irq_enter+0x7c/0xac
[  106.580950]  irq_enter_rcu+0x60/0x74
[  106.581268]  el1_interrupt+0x24/0x4c
[  106.581587]  el1h_64_irq_handler+0x14/0x1c
[  106.581948]  el1h_64_irq+0x64/0x68
[  106.582252]  default_idle_call+0x24/0x34
[  106.582601]  do_idle+0xa4/0xf4
[  106.582873]  cpu_startup_entry+0x24/0x28
[  106.583220]  kernel_init+0x0/0x1cc
[  106.583521]  arch_post_acpi_subsys_init+0x0/0x8
[  106.583925]  start_kernel+0x4a4/0x57c
[  106.584251]  __primary_switched+0xb4/0xbc
[  106.584612] Code: 350020e1 f9409ea0 a90363f7 52000273 (f9400001) 
[  106.585146] ---[ end trace 0000000000000000 ]---
[  106.585551] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[  106.586151] SMP: stopping secondary CPUs
[  107.753162] SMP: failed to stop secondary CPUs 0,2-5
[  107.753598] Kernel Offset: disabled
[  107.753905] CPU features: 0x40000104,1a000000,0800400b
[  107.754356] Memory Limit: none
[  107.754628] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---
mrfixit2001 commented 1 year ago

I am also wondering if there is anything we need to change to integrate with this in the upstream kernel: https://github.com/torvalds/linux/commit/d08d42de6432d5064045159aed060e3db9fa7807

mrfixit2001 commented 12 months ago

@giuliobenetti You can close this issue - I was able to make everything work after my new patches on kernel 6.6 :)

The above mentioned errors were kind of misleading and ultimately the issue was actually related to a change in the kernel power management driver - not having to do with mali at all.

In any event - the initially defined error in this issue was resolved by my PR. I'll leave the rest of the error outputs for future reference in case anyone is googling and this comes up.

When you have time, I encourage you to refactored and adjusted dma fence to always call dma_resv_reserve_fences for both read and write dma reservations, as well as update kbase_dma_fence_lock_reservations to use dma resv locks instead of ww mutex - because this is what other upstream DRM drivers are doing.

Thanks again for all your help and time!

giuliobenetti commented 11 months ago

@giuliobenetti You can close this issue - I was able to make everything work after my new patches on kernel 6.6 :)

Very well! Can you please point the Linux patches you're referring to? Once I have those patches I test on my board then.

The above mentioned errors were kind of misleading and ultimately the issue was actually related to a change in the kernel power management driver - not having to do with mali at all.

Ok, good to know

In any event - the initially defined error in this issue was resolved by my PR. I'll leave the rest of the error outputs for future reference in case anyone is googling and this comes up.

Thanks a lot for providing such fix

When you have time, I encourage you to refactored and adjusted dma fence to always call dma_resv_reserve_fences for both read and write dma reservations, as well as update kbase_dma_fence_lock_reservations to use dma resv locks instead of ww mutex - because this is what other upstream DRM drivers are doing.

Can you please point me some example here? Or if you have it would be great if you can contribute with patches since you've already faced the problem.

Thanks again for all your help and time!

Thank you too :-)