Linux 6.2 issue - Githubissues

giuliobenetti commented 1 year ago

Refer to #6 for further informations.

cbalint13 commented 1 year ago

@giuliobenetti ,

I've tested it on rk3399 NanoPC-T2 with 6.2.9-300.fc38.aarch64 It loads, but probing it with ocl fails with kernel crash, see outputs below. With 6.1.8 kernel all works fine.

Let me know if this 6.x tree can be fixed, I am interested to follow.

To be able to pass the compilation failure for 6.2 kernel:


index 710ebac..7cdf1a9 100644
--- a/r8p0/drivers/gpu/arm/midgard/mali_kbase_mem_linux.c
+++ b/r8p0/drivers/gpu/arm/midgard/mali_kbase_mem_linux.c
@@ -2514,7 +2514,7 @@ KBASE_EXPORT_TEST_API(kbase_vunmap);
#if (LINUX_VERSION_CODE >= KERNEL_VERSION(5, 5, 0))
static void mali_add_mm_counter(struct mm_struct *mm, int member, long value)
{

atomic_long_add(value, &mm->rss_stat.count[MM_FILEPAGES]);
percpu_counter_add(&mm->rss_stat[MM_FILEPAGES], value); }

else

static void mali_add_mm_counter(struct mm_struct *mm, int member, long value)
Kernel loading:


# dmesg | grep mali
[   21.078657] mali_kbase: loading out-of-tree module taints kernel.
[   21.082991] mali_kbase: module verification failed: signature and/or required key missing - tainting kernel
[   21.167863] mali ff9a0000.gpu: GPU identified as 0x0860 r2p0 status 0
[   21.168701] mali ff9a0000.gpu: Protected mode not available
[   21.178130] mali ff9a0000.gpu: Probed as mali0

OpenCL device probing using clDeviceQuery.cpp :

[  412.756241] WARNING: CPU: 4 PID: 1038 at mm/memory.c:5175 handle_mm_fault+0x274/0x290
[  412.756938] Modules linked in: bnep vfat fat brcmfmac_wcc btsdio snd_soc_simple_card snd_soc_hdmi_codec brcmfmac snd_soc_simple_card_utils snd_soc_rockchip_i2s snd_soc_core hantro_vpu rockchip_vdec(C) snd_compress ac97_bus snd_pcm_dmaengine snd_seq snd_seq_device gpio_ir_recv snd_pcm leds_gpio mali_kbase(OE) snd_timer brcmutil hci_uart v4l2_vp9 snd soundcore cfg80211 btqca btrtl v4l2_h264 btbcm rockchip_rga videobuf2_dma_contig btintel videobuf2_dma_sg videobuf2_memops v4l2_mem2mem videobuf2_v4l2 bluetooth videobuf2_common videodev dwmac_rk stmmac_platform rfkill mc stmmac pcs_xpcs phylink rockchip_saradc industrialio_triggered_buffer rockchip_thermal kfifo_buf coresight_cpu_debug coresight cpufreq_dt lz4 lz4_compress zram fuse loop xhci_plat_hcd mmc_block dwc3 dw_hdmi_cec dw_hdmi_i2s_audio ulpi udc_core panfrost rockchipdrm nvme fusb302 crct10dif_ce polyval_ce polyval_generic ghash_ce dwc3_of_simple pwm_fan gpio_keys drm_dma_helper gpu_sched des_generic phy_rockchip_emmc analogix_dp
[  412.757109]  rk_crypto tcpm nvme_core dw_mipi_dsi libdes dw_hdmi phy_rockchip_inno_usb2 rtc_rk808 sdhci_of_arasan dw_wdt adc_keys ohci_platform drm_display_helper sdhci_pltfm typec pl330 industrialio ohci_hcd io_domain phy_rockchip_typec nvme_common sdhci nvmem_rockchip_efuse dw_mmc_rockchip pwm_rockchip cec dw_mmc_pltfm ehci_platform dw_mmc cqhci scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
[  412.767776] CPU: 4 PID: 1038 Comm: clDeviceQuery Tainted: G         C OE      6.2.9-300.fc38.aarch64 #1
[  412.768600] Hardware name: Unknown Unknown Product/Unknown Product, BIOS 2023.01 01/01/2023
[  412.769330] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  412.769941] pc : handle_mm_fault+0x274/0x290
[  412.770320] lr : handle_mm_fault+0xbc/0x290
[  412.770691] sp : ffff80000d5bbda0
[  412.770983] x29: ffff80000d5bbda0 x28: ffff000007c8a1c0 x27: 0000000000000002
[  412.771614] x26: ffff0000078ac460 x25: 0000000000000255 x24: 0000000000000000
[  412.772243] x23: ffff000006681000 x22: ffff80000d5bbeb0 x21: 0000ffff930f1000
[  412.772873] x20: ffff0000049df2f8 x19: 0000000000000255 x18: 0000000000000000
[  412.773502] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000095f4f10
[  412.774131] x14: ffff0000078ac400 x13: 1fffe0000084dba1 x12: ffffffffffffffff
[  412.774760] x11: 0000ffff930e8000 x10: 0000ffff930f1000 x9 : ffff8000084d7c4c
[  412.775388] x8 : 0000000000000000 x7 : ffff00000426dd00 x6 : 000000000000000a
[  412.776015] x5 : ffff000007c8a1c0 x4 : 0000000000000000 x3 : 00000000000000c0
[  412.776644] x2 : ffff000007c8a1c0 x1 : 0000000000000000 x0 : 00000000040644cb
[  412.777273] Call trace:
[  412.777491]  handle_mm_fault+0x274/0x290
[  412.777841]  do_page_fault+0x1f4/0x53c
[  412.778176]  do_mem_abort+0x4c/0xa0
[  412.778488]  el0_da+0x48/0x120
[  412.778762]  el0t_64_sync_handler+0xe4/0x120
[  412.779142]  el0t_64_sync+0x194/0x198
[  412.779467] ---[ end trace 0000000000000000 ]---

giuliobenetti commented 1 year ago

@cbalint13 is the board really the NanoPC-T2? Because that has Samsung S5P4418 SoC: https://www.friendlyelec.com/index.php?route=product/product&path=69&product_id=103

I think you're running on NanoPC-T4: https://www.friendlyelec.com/index.php?route=product/product&path=69&product_id=225 Right?

Anyway the patch you've proposed looks correct to me. I've committed a patch only for consistency but it doesn't fix anything.

giuliobenetti commented 1 year ago

I've just ordered a RK3399 board that is already supported by Buildroot so I can check and debug it.

cbalint13 commented 1 year ago

I think you're running on NanoPC-T4: https://www.friendlyelec.com/index.php?route=product/product&path=69&product_id=225 Right?

Yes, I think mentioned it in previous comments being a "FriendlyElec NanoPC-T4".

Anyway the patch you've proposed looks correct to me. I've committed a patch only for consistency but it doesn't fix anything.

Yes, it only address some api changes, that was the only obvious one due to compile error.

cbalint13 commented 1 year ago

@giuliobenetti ,

I've just ordered a RK3399 board that is already supported by Buildroot so I can check and debug it.

Let me know if need a "quick" basic/minimal sdcard image: efi+grub+kernel-6.x (it is not that easy to get efi+grub+6.x)
The rootfs (btrfs) can be anything you would like, I can bring by default a fedora one, you can replace it after.

giuliobenetti commented 1 year ago

@giuliobenetti ,

I've just ordered a RK3399 board that is already supported by Buildroot so I can check and debug it.

Let me know if need a "quick" basic/minimal sdcard image: efi+grub+kernel-6.x (it is not that easy to get efi+grub+6.x)

The rootfs (btrfs) can be anything you would like, I can bring by default a fedora one, you can replace it after.

Thanks a lot. For the moment I’ve built Buildroot. But I need to wait for the board to arrive, I’ve ordered it 2 hours ago :-)

giuliobenetti commented 1 year ago

@cbalint13 can you please point me the URL of the mali blob you're using? So I can setup my Buildroot system correctly. At the moment I'm using G31 blob without version checking, but the repository I pick the blob from link only has blob for r18p0 version and it would fail against r8p0 driver because of version checking. You should have a blob without version checking.

Thanks in advance for helping!

cbalint13 commented 1 year ago

@giuliobenetti ,

@cbalint13 can you please point me the URL of the mali blob you're using? So I can setup my Buildroot system correctly. At the moment I'm using G31 blob without version checking, but the repository I pick the blob from link only has blob for r18p0 version and it would fail against r8p0 driver because of version checking. You should have a blob without version checking.

I am using libmali-midgard-t86x-r18p0-x11.so
If you have a newer one I would be glad to test it !

The original (friendlyelec) repo is gone: https://github.com/rockchip-linux/libmali
I still have a copy of the repo if you need it but only r18p0 & r14p0 is availablable for midgard.
The package (+repo +receipt) is here: https://copr.fedorainfracloud.org/coprs/rezso/ML/build/5745258

In short this is exposed to system:

# rpm -ql libmali
/etc/OpenCL/vendors/mali.icd
/usr/lib/.build-id
/usr/lib/.build-id/67
/usr/lib/.build-id/67/22e723d65ca9ddbf0d0e14af3ce769718f9f6c
/usr/lib64/libGLES_mali.so
/usr/lib64/libMaliOpenCL.so
/usr/lib64/libmali.so
/usr/share/licenses/libmali
/usr/share/licenses/libmali/END_USER_LICENCE_AGREEMENT.txt

# cat /etc/OpenCL/vendors/mali.icd
libMaliOpenCL.so

Excerpt from build receipt:

cp -Pf lib/aarch64-linux-gnu/libmali-midgard-t86x-r18p0-x11.so %{buildroot}/%{_libdir}/libMaliOpenCL.so
ln -s libMaliOpenCL.so libmali.so
ln -s libMaliOpenCL.so libGLES_mali.so

Thanks in advance for helping!

Let me know if need more details toward reproducibility.

cbalint13 commented 1 year ago

@giuliobenetti ,

Double checked, It seems that is library independent, r8p-kernel-drv works with v1.r18p0-01rel0.5cb5681058e8e076ff89747c20c32578 :

# ./clDeviceQuery 
clDeviceQuery Starting...

arm_release_ver of this libmali is 'r18p0-01rel0', rk_so_ver is '4'.1 OpenCL Platforms found

 CL_PLATFORM_NAME:  ARM Platform
 CL_PLATFORM_VERSION:   OpenCL 1.2 v1.r18p0-01rel0.5cb5681058e8e076ff89747c20c32578
OpenCL Device Info:

 1 devices found supporting OpenCL on: ARM Platform

 ----------------------------------
 Device Mali-T860
 ---------------------------------
  CL_DEVICE_NAME:           Mali-T860
  CL_DEVICE_VENDOR:             ARM
  CL_DRIVER_VERSION:            1.2

cbalint13 commented 1 year ago

@giuliobenetti ,

Triple checked,

It seems r18 pass detection but fail to work with the kernel driver.
So only r14 will work fine (tested with many kernels) on this very driver here.

Failed test with r18 library:

# ./clDeviceQuery 
clDeviceQuery Starting...

arm_release_ver of this libmali is 'r18p0-01rel0', rk_so_ver is '4'.1 OpenCL Platforms found

 CL_PLATFFORM_VERSION:          OpenCL 1.2 v1.r18p0-01rel0.5cb5681058e8e076ff89747c20c32578

# ./ocl-test 
[   52.944133] mali ff9a0000.gpu: Stride passed to job_submit doesn't match kernel

Passed test with r14 library:

# ./clDeviceQuery                                                                                                                                                                                                          
clDeviceQuery Starting...

1 OpenCL Platforms found

 CL_PLATFORM_NAME:              ARM Platform
 CL_PLATFORM_VERSION:           OpenCL 1.2 v1.r14p0-01rel0-git(a79caef).8ddfd7584149d9238dced4e406610de7
OpenCL Device Info:

# ./ocl-test 
2.000000 * 0.000000 + 1024.000000 = 1024.000000
2.000000 * 1.000000 + 1023.000000 = 1025.000000
2.000000 * 2.000000 + 1022.000000 = 1026.000000
2.000000 * 3.000000 + 1021.000000 = 1027.000000
2.000000 * 4.000000 + 1020.000000 = 1028.000000
2.000000 * 5.000000 + 1019.000000 = 1029.000000

giuliobenetti commented 1 year ago

@cbalint13 thanks a lot for all the tests. But I’m a bit confused. Can you please summarize which version works against this driver pointing also the url of the blob and all the logs? In the beginning you’ve pointed me a segfault but now I don’t see it anymore, so can you explain the relationship with the segfault?

Thanks a lot!

cbalint13 commented 1 year ago

@giuliobenetti ,

@cbalint13 thanks a lot for all the tests. But I’m a bit confused. Can you please summarize which version works against this driver pointing also the url of the blob and all the logs?

So, r14 userland library works (detection + any-ocl-kernels) with your 6.1 vanilla branch on a 6.1.8 kernel.
Now, for the 6.2 kernel the load of gpu driver crashes, nothing works, see the first comment for 6.2 case.
The URL for r14-midgard lib: https://github.com/ariaboard-com/rockchip_libmali/tree/master/lib/aarch64-linux-gnu

In the beginning you’ve pointed me a segfault but now I don’t see it anymore, so can you explain the relationship with the segfault?

In the first comment of this issue are the 6.2 results, the driver crash (unlike on 6.1.8)
I've tested out which r1x-midgard library is suitable for any tests (using a working 6.1.8 kernel setup)

Thanks a lot!

mrfixit2001 commented 1 year ago

@giuliobenetti I can confirm this still exists on 6.5, been working to resolve and have tried a few different variations of the driver and patches

[  617.870232] WARNING: CPU: 4 PID: 2952 at mm/memory.c:5185 handle_mm_fault+0x1f0/0x210
[  617.870924] Modules linked in: 8021q btsdio hci_uart btqca btusb btrtl btbcm btintel bluetooth ecdh_generic ecc ir_rcmm_decoder ir_imon_decoder ir_xmp_decoder ir_mce_kbd_decoder ir_sharp_decoder ir_sanyo_decoder ir_sony_decoder ir_jvc_decoder ir_rc6_decoder ir_nec_decoder ir_rc5_decoder fusb302 tcpm rk_crypto spi_rockchip pwm_fan rk3399_dmc crypto_engine
[  617.873694] CPU: 4 PID: 2952 Comm: emulationstatio Not tainted 6.5.0 #37
[  617.874280] Hardware name: Pine64 RockPro64 v2.1 (DT)
[  617.874721] pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  617.875329] pc : handle_mm_fault+0x1f0/0x210
[  617.875704] lr : do_page_fault+0x1b0/0x444
[  617.876064] sp : ffff800085a6bdb0
[  617.876355] x29: ffff800085a6bdb0 x28: ffff0000f367a880 x27: 0000000000000000
[  617.876981] x26: ffff0000f367a880 x25: 0000000000000002 x24: 00000000f7e2f000
[  617.877607] x23: 000000009200004f x22: ffff800085a6beb0 x21: ffff000006ea63c0
[  617.878231] x20: 00000000f7e2f000 x19: 0000000000000255 x18: ffff800085a6bda8
[  617.878857] x17: 0000000000000000 x16: ffff8000811ed598 x15: 00000000f7e6efff
[  617.879481] x14: 00000000f7d69000 x13: 1fffe00001ea0261 x12: ffff800085a6bd48
[  617.880107] x11: ffff00000f501300 x10: ffff00000f50130c x9 : ffff00000f501308
[  617.880732] x8 : 00000000f7e2f000 x7 : 00000000f7e2f000 x6 : ffff00000f501380
[  617.881356] x5 : 0000000000000006 x4 : ffff0000f367a880 x3 : ffff800085a6beb0
[  617.881981] x2 : 0000000000000255 x1 : 00000000040644cb x0 : ffff000006c43960
[  617.882607] Call trace:
[  617.882822]  handle_mm_fault+0x1f0/0x210
[  617.883168]  do_page_fault+0x1b0/0x444
[  617.883497]  do_mem_abort+0x40/0x8c
[  617.883804]  el0_da+0x20/0x54
[  617.884070]  el0t_32_sync_handler+0xf4/0x114
[  617.884445]  el0t_32_sync+0x150/0x154
[  617.884768] ---[ end trace 0000000000000000 ]---

I'm in a 32-bit userland with a 64-bit kernel, which is different than the issue author, but have the same error. Seems an upstream change in 6.2 has triggered this memory incompatibility.

I'm happy to test anything you send!

giuliobenetti commented 1 year ago

@mrfixit2001 @cbalint13 I'm very sorry I still haven't found time to fix this issue.

@mrfixit2001 I agree with you, it seems like a memory incompatibility and it looks the same as @cbalint13 has pointed above.

@mrfixit2001 Are you using OpenCL or OpenGL Userspace Blobs? This can help me to address the problem.

mrfixit2001 commented 1 year ago

@giuliobenetti appreciate the quick reply!

I'm testing GLES / GBM using a RK3399 Midgard

giuliobenetti commented 1 year ago

@giuliobenetti appreciate the quick reply!

I'm testing GLES / GBM using a RK3399 Midgard

Ok, so this is a common problem between both OpenGL and OpenCL. I'd need a longer backtrace. Would it be possible for you to use Ftrace?

giuliobenetti commented 1 year ago

@mrfixit2001 @cbalint13 could you please give a try to branch https://github.com/giuliobenetti/mali-driver/tree/test/fix-6.2%2B and see if that fixes the runtime failure?

Thanks a lot!

mrfixit2001 commented 1 year ago

@giuliobenetti unfortunately that patch does not resolve. Same failure output.

giuliobenetti commented 1 year ago

@giuliobenetti unfortunately that patch does not resolve. Same failure output.

Ok, thanks for testing. That patch is needed for consistency in any case so I will commit it later.

@mrfixit2001 would it be possible for you to issue a ftrace on modprobe?

I will do my best to bring up a board to debug such bug.

mrfixit2001 commented 1 year ago

@giuliobenetti I am compiling midgard as built-in rather than as a module, but I will see about adding ftrace.

I've been staring at this code a few days now... Could this possibly be due to the reimplementation of kbase_unmapped_area_topdown? It coincidentally changed right around that same time to use a maple tree instead of rbtree. 3499a13168da6a0c122c70f24e653b650d18c882

mrfixit2001 commented 1 year ago

@giuliobenetti

Attached is a function-graph trace of attempting to start my application. Please let me know what other debug detail you require. And thanks again for your time and involvement!

trace.txt

mrfixit2001 commented 1 year ago

@giuliobenetti I enabled some additional mali tracing in the kernel, not sure if this helps more or not but here is another trace. To be clear, the driver probes fine, it fails when being used.

trace.txt

[    2.271101] mali ff9a0000.gpu: GPU identified as 0x0860 r2p0 status 0
[    2.271805] mali ff9a0000.gpu: Protected mode not available
[    2.272679] mali ff9a0000.gpu: Continuing without devfreq
[    2.273615] mali ff9a0000.gpu: Probed as mali0

mrfixit2001 commented 1 year ago

@giuliobenetti

In case the function-graph isn't what you wanted, here is a new function trace instead

function-trace.zip

giuliobenetti commented 1 year ago

@giuliobenetti I enabled some additional mali tracing in the kernel, not sure if this helps more or not but here is another trace. To be clear, the driver probes fine, it fails when being used.

trace.txt
[    2.271101] mali ff9a0000.gpu: GPU identified as 0x0860 r2p0 status 0
[    2.271805] mali ff9a0000.gpu: Protected mode not available
[    2.272679] mali ff9a0000.gpu: Continuing without devfreq
[    2.273615] mali ff9a0000.gpu: Probed as mali0

Thank you but this is a normal behavior, the driver works even without devfreq.

giuliobenetti commented 1 year ago

@giuliobenetti

In case the function-graph isn't what you wanted, here is a new function trace instead

function-trace.zip

Yes, this is close to what I need, but I'd need the stackframe on segault including the last mali driver calls. Anyway, even if that could help, it's not that easy. I'm setting up the bsp and debug environment with TRACE32. It takes some time. Once I have news I will post here, hopefully with a fix. If you could produce the stackframe of mali driver up to the segfault it would help.

Thank you!

mrfixit2001 commented 1 year ago

FYI - I just tested with the bleeding edge commit from torvalds, same error.

Here's the full GDB backtrace:

Thread 1 "emulationstatio" received signal SIGSEGV, Segmentation fault.
0xf45616d0 in memset () from /lib/libc.so.6
(gdb) thread apply all bt

Thread 9 (Thread 0xef1c7f80 (LWP 3069) "mali-cmar-backe"):
#0  0xf45c3558 in poll () from /lib/libc.so.6
#1  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 8 (Thread 0xef9c8f80 (LWP 3068) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 7 (Thread 0xf01c9f80 (LWP 3067) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 6 (Thread 0xf09caf80 (LWP 3066) "mali-utility-wo"):
--Type <RET> for more, q to quit, c to continue without paging--
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 5 (Thread 0xf11cbf80 (LWP 3065) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 4 (Thread 0xf19ccf80 (LWP 3064) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

--Type <RET> for more, q to quit, c to continue without paging--
Thread 3 (Thread 0xf21cdf80 (LWP 3063) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 2 (Thread 0xf29cef80 (LWP 3062) "mali-mem-purge"):
#0  0xf45867ac in __clock_nanosleep_time64 () from /lib/libc.so.6
#1  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0xf7fb0280 (LWP 3018) "emulationstatio"):
#0  0xf45616d0 in memset () from /lib/libc.so.6
#1  0xf50d6a14 in ?? () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

giuliobenetti commented 1 year ago

FYI - I just tested with the bleeding edge commit from torvalds, same error.

Here's the full GDB backtrace:

Thread 1 "emulationstatio" received signal SIGSEGV, Segmentation fault.
0xf45616d0 in memset () from /lib/libc.so.6
(gdb) thread apply all bt

Thread 9 (Thread 0xef1c7f80 (LWP 3069) "mali-cmar-backe"):
#0  0xf45c3558 in poll () from /lib/libc.so.6
#1  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 8 (Thread 0xef9c8f80 (LWP 3068) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 7 (Thread 0xf01c9f80 (LWP 3067) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 6 (Thread 0xf09caf80 (LWP 3066) "mali-utility-wo"):
--Type <RET> for more, q to quit, c to continue without paging--
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 5 (Thread 0xf11cbf80 (LWP 3065) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 4 (Thread 0xf19ccf80 (LWP 3064) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

--Type <RET> for more, q to quit, c to continue without paging--
Thread 3 (Thread 0xf21cdf80 (LWP 3063) "mali-utility-wo"):
#0  0xf4543d68 in ?? () from /lib/libc.so.6
#1  0xf45520d0 in ?? () from /lib/libc.so.6
#2  0xf45521f8 in ?? () from /lib/libc.so.6
#3  0xf518a00a in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 2 (Thread 0xf29cef80 (LWP 3062) "mali-mem-purge"):
#0  0xf45867ac in __clock_nanosleep_time64 () from /lib/libc.so.6
#1  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0xf7fb0280 (LWP 3018) "emulationstatio"):
#0  0xf45616d0 in memset () from /lib/libc.so.6
#1  0xf50d6a14 in ?? () from /usr/lib/libmali.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

@mrfixit2001 Thank you for the effort! This is the backtrace of the userspace, so the functions I see are the blobs one. I'm a bit confused now, so is the driver that panics the kernel? Or is the application that panics the kernel using the driver? I mean, if you modprobe this driver does it spit out that segfault? Or does it show that segfault while executing an application linked with libmali.so?

giuliobenetti commented 1 year ago

@mrfixit2001 Ok, finally I have Rockpro64-V2 up and running where I have RK3399 with Mali-T860. modprobe mali_kbase works correctly so this is something that is triggered from blob, not easy, but I can catch it by entering debugging, not an easy and fast task. I will let you know once done.

giuliobenetti commented 1 year ago

@mrfixit2001 I've reproduced the error with the same board you have. I've straced glmark2-es2-drm and it dies here:

ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x82, 0, 0x38), 0xffffe2233b20) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0x41000) = 0xffff95b5c000
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffff95b5c000} ---
+++ killed by SIGSEGV +++

Need to dig. I will find an easier test program so I have less function calls. Unfortunately on RK3399 they share SWD lines with sd-card and so I should setup a nfsroot to connect with debugger... I will keep you updated.

mrfixit2001 commented 1 year ago

Thank you for keeping us updated!! I’m excited you’re able to reproduce and am hopeful you’ll find a fix soon. I don’t mind patching DRM instead of mali if that’s needed. Looking forward to your reply.

mrfixit2001 commented 1 year ago

@giuliobenetti any progress? Would you mind sharing the full trace/dump so any of us interested can try and help? Side note - is the KDS DMA patch included in this repo required for proper midgard functionality in the modern kernel?

giuliobenetti commented 1 year ago

@giuliobenetti any progress? Would you mind sharing the full trace/dump so any of us interested can try and help? Side note - is the KDS DMA patch included in this repo required for proper midgard functionality in the modern kernel?

I still had no time to enter debugging, so I neither have the full trace. I have to put my hands on this soon. What do you mean with KDS DMS patch? Can you elaborate?

mrfixit2001 commented 1 year ago

@giuliobenetti any progress? Would you mind sharing the full trace/dump so any of us interested can try and help? Side note - is the KDS DMA patch included in this repo required for proper midgard functionality in the modern kernel?

I still had no time to enter debugging, so I neither have the full trace. I have to put my hands on this soon. What do you mean with KDS DMS patch? Can you elaborate?

Thanks for the update, looking forward to hearing back. Regarding the KDS DMS patch - I'm referring to this: (https://github.com/bootlin/mali-driver/blob/master/r8p0/patches/integrate_kds_with_dma_buf.patch)

mrfixit2001 commented 1 year ago

I was able to get around the shown error by fixing the way the vma flags are cleared in kbase_mmap. The << 4 is no longer correct. Now there's a DMA fence issue:


virtual address 0000000000000010
[   60.771002] Mem abort info:
[   60.771248]   ESR = 0x0000000096000007
[   60.771578]   EC = 0x25: DABT (current EL), IL = 32 bits
[   60.772044]   SET = 0, FnV = 0
[   60.772335]   EA = 0, S1PTW = 0
[   60.772612]   FSC = 0x07: level 3 translation fault
[   60.773040] Data abort info:
[   60.773293]   ISV = 0, ISS = 0x00000007, ISS2 = 0x00000000
[   60.773773]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[   60.774216]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[   60.774682] user pgtable: 4k pages, 48-bit VAs,
pgdp=00000000f33be000
[   60.775244] [0000000000000010] pgd=0800000006c09003,
p4d=0800000006c09003, pud=080000001160a003, pmd=08000000129
43003, pte=0000000000000000
[   60.776356] Internal error: Oops: 0000000096000007 [#1] SMP
[   60.776847] Modules linked in: 8021q btsdio hci_uart btqca btusb
btrtl btbcm btintel bluetooth ecdh_generic ecc
ir_rcmm_decoder ir_imon_decoder ir_xmp_decoder ir_mce_kbd_decoder
ir_sharp_decoder ir_sanyo_decoder ir_sony_decoder
 ir_jvc_decoder ir_rc6_decoder ir_nec_decoder ir_rc5_decoder fusb302
tcpm rk_crypto pwm_fan spi_rockchip rk3399_dmc
 crypto_engine
[   60.779635] CPU: 0 PID: 2691 Comm: mali-cmar-backe Not tainted
6.5.0 #14
[   60.780223] Hardware name: Pine64 RockPro64 v2.1 (DT)
[   60.780666] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS
BTYPE=--)
[   60.781277] pc : dma_resv_add_fence+0x7c/0x21c
[   60.781679] lr : kbase_dma_fence_wait+0x170/0x3d4
[   60.782097] sp : ffff800084af3ab0
[   60.782388] x29: ffff800084af3ab0 x28: ffff800084651000 x27:
ffff000007e33600
[   60.783018] x26: 0000000103115001 x25: 0000000000000000 x24:
0000000000000000
[   60.783647] x23: 0000000000000001 x22: ffff0000032c4300 x21:
0000000000000000
[   60.784276] x20: 0000000000000000 x19: ffff0000061b3100 x18:
0000000000000000
[   60.784904] x17: 0000000000000000 x16: 0000000000000000 x15:
0000000000000002
[   60.785533] x14: 0000000000000001 x13: 00000000000da51e x12:
0000000000000048
[   60.786161] x11: 00000000000007e8 x10: ffff800084bedaa8 x9 :
ffff0000025bbf00
[   60.786789] x8 : ffff0000032c4340 x7 : 0000000000000000 x6 :
0000000000000000
[   60.787418] x5 : ffff0000032c4310 x4 : 0000000000000001 x3 :
0000000000000000
[   60.788047] x2 : ffff80008105f5e0 x1 : ffff80008106ce30 x0 :
ffff80008106ce80
[   60.788677] Call trace:
[   60.788894]  dma_resv_add_fence+0x7c/0x21c
[   60.789256]  kbase_dma_fence_wait+0x170/0x3d4
[   60.789640]  jd_submit_atom+0x888/0x9a4
[   60.789981]  kbase_jd_submit+0x214/0x348
[   60.790328]  kbase_ioctl+0xb6c/0x157c
[   60.790655]  __arm64_compat_sys_ioctl+0x140/0x160
[   60.791074]  invoke_syscall+0x44/0x108
[   60.791411]  el0_svc_common.constprop.0+0x40/0xd8
[   60.791827]  do_el0_svc_compat+0x18/0x38
[   60.792175]  el0_svc_compat+0x14/0x48
[   60.792505]  el0t_32_sync_handler+0x88/0x114
[   60.792881]  el0t_32_sync+0x150/0x154
[   60.793210] Code: fa401044 54000041 d4210000 f9401678 (b9401319)
[   60.793744] ---[ end trace 0000000000000000 ]---           ```

On Mon, Oct 23, 2023 at 2:21 PM Giulio Benetti ***@***.***>
wrote:

> @giuliobenetti <https://github.com/giuliobenetti> any progress? Would you
> mind sharing the full trace/dump so any of us interested can try and help?
> Side note - is the KDS DMA patch included in this repo required for proper
> midgard functionality in the modern kernel?
>
> I still had no time to enter debugging, so I neither have the full trace.
> I have to put my hands on this soon.
> What do you mean with KDS DMS patch? Can you elaborate?
>
> —
> Reply to this email directly, view it on GitHub
> <https://github.com/bootlin/mali-driver/issues/7#issuecomment-1775764674>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AJIK7DCOURCEZBWFMRT6GI3YA2YRTAVCNFSM6AAAAAAWTDPD6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZVG43DINRXGQ>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>

giuliobenetti commented 1 year ago

Hi @mrfixit2001,

The << 4 is no longer correct.

can you elaborate more? Is there a Linux commit that requires that shift to be changed? If yes please open a PR documenting the reason. Thank you!

mrfixit2001 commented 1 year ago

@giuliobenetti I've created the PR for you. Feel free to edit if you'd like ofc. The change I've made should also be more future-proof without the shift.

https://github.com/bootlin/mali-driver/pull/8

I would appreciate any insight you may have into the new error I'm getting, the NULL pointer dereference

mrfixit2001 commented 1 year ago

@giuliobenetti FYI - the new null reference error is being thrown because dma_resv_fences_list is returning NULL... which ultimately means __rcu_dereference_check is returning null. When I add a check to dma-resv that checks for this NULL I can bypass the error but then either the card locks up without throwing a panic or the mali driver throws errors about job hard stops, failures, and faults.

giuliobenetti commented 1 year ago

@giuliobenetti FYI - the new null reference error is being thrown because dma_resv_fences_list is returning NULL... which ultimately means __rcu_dereference_check is returning null. When I add a check to dma-resv that checks for this NULL I can bypass the error but then either the card locks up without throwing a panic or the mali driver throws errors about job hard stops, failures, and faults.

@mrfixit2001 Can you please give a fast try with the 2 changes below and let me know the result? Enable CONFIG_DRM_FBDEV_LEAK_PHYS_SMEM, and pass `drm_kms_helper.drm_leak_fbdev_smem=1' to Linux bootargs. I still have to enter debug so I'm pretty blind. Feel free to check my previous commits too, I could have inserted a regression for that NULL return.

mrfixit2001 commented 1 year ago

@giuliobenetti Thanks for the Idea, but unfortunately that does not change the results, same error observed. Please let me know if you have any other ideas. For now I am testing with CONFIG_MALI_DMA_FENCE disabled, but I am then getting IOMMU errors in the VOP. So I expect the DMA fence is going to be required and must be fixed.

On a completely different topic, this will also need to be adjusted for newer kernels: https://github.com/bootlin/mali-driver/blob/master/r8p0/drivers/gpu/arm/midgard/ipa/mali_kbase_ipa.c#L577

After this commit: https://github.com/torvalds/linux/commit/615510fe13bd2434610193f1acab53027d5146d6

giuliobenetti commented 1 year ago

@giuliobenetti Thanks for the Idea, but unfortunately that does not change the results, same error observed. Please let me know if you have any other ideas. For now I am testing with CONFIG_MALI_DMA_FENCE disabled, but I am then getting IOMMU errors in the VOP. So I expect the DMA fence is going to be required and must be fixed.

Ok, it was only a fast try.

On a completely different topic, this will also need to be adjusted for newer kernels: https://github.com/bootlin/mali-driver/blob/master/r8p0/drivers/gpu/arm/midgard/ipa/mali_kbase_ipa.c#L577

After this commit: torvalds/linux@615510f

Can you open a PR for that? Can you also create commit log like I've done for previous commits? Check here: https://github.com/bootlin/mali-driver/commit/c90627f78d58567a2acb7cbf77d565e03a131294

Thanks you!

mrfixit2001 commented 1 year ago

@giuliobenetti

I have refactored and adjusted dma fence to always call dma_resv_reserve_fences for both read and write dma reservations, and I've updated kbase_dma_fence_lock_reservations to use dma resv locks instead of ww mutex's (this mirrors changes done to the DRM drivers). This resolves the null error... but now I'm still stuck with iommu errors. And if the board doesn't lock up afterwards then I'm given a bunch of mali errors.

Here's the next error to be investigated...

[   20.339631] rk_iommu ff903f00.iommu: Enable stall request timed out, status: 0x000001
[   20.341404] rk_iommu ff903f00.iommu: Disable paging request timed out, status: 0x000001
[   20.348044] ------------[ cut here ]------------
[   20.348888] WARNING: CPU: 1 PID: 672 at drivers/iommu/iommu.c:122 iommu_detach_device+0xb4/0xbc
[   20.349662] Modules linked in: btsdio hci_uart btqca btusb btrtl btbcm btintel bluetooth ecdh_generic ecc ir_rcmm_decoder ir_imon_decoder ir_xmp_decoder ir_mce_kbd_decoder ir_sharp_decoder ir_sanyo_decoder ir_sony_decoder ir_jvc_decoder ir_rc6_decoder ir_nec_decoder ir_rc5_decoder fusb302 rk_crypto tcpm pwm_fan spi_rockchip rk3399_dmc crypto_engine
[   20.352406] CPU: 1 PID: 672 Comm: ffplay Not tainted 6.5.0 #26
[   20.352918] Hardware name: Pine64 RockPro64 v2.1 (DT)
[   20.353361] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   20.353972] pc : iommu_detach_device+0xb4/0xbc
[   20.354366] lr : iommu_detach_device+0x80/0xbc
[   20.354758] sp : ffff80008517b900
[   20.355050] x29: ffff80008517b900 x28: 0000000000000000 x27: ffff800081038c58
[   20.355679] x26: ffff00000ba6b400 x25: ffff800081a5c25f x24: ffff800081478e50
[   20.356308] x23: 0000000000000038 x22: ffff0000003dd800 x21: ffff000002f51468
[   20.356937] x20: ffff00000327b6a8 x19: ffff000002f51400 x18: 0000000000000030
[   20.357565] x17: 7574617473202c74 x16: 756f2064656d6974 x15: 2074736575716572
[   20.358194] x14: ffff800081940dd8 x13: 0000000000000684 x12: 000000000000022c
[   20.358821] x11: 202c74756f206465 x10: ffff800081998dd8 x9 : 00000000fffff000
[   20.359450] x8 : ffff800081940dd8 x7 : ffff800081998dd8 x6 : 0000000000000001
[   20.360078] x5 : ffff0000025b99c0 x4 : 0000000000000000 x3 : 0000000000000001
[   20.360706] x2 : 0000000000000001 x1 : 0000000000000005 x0 : 00000000ffffff92
[   20.361336] Call trace:
[   20.361552]  iommu_detach_device+0xb4/0xbc
[   20.361916]  rockchip_drm_dma_detach_device+0x18/0x24
[   20.362367]  vop_crtc_atomic_disable+0x264/0x388
[   20.362774]  disable_outputs+0x22c/0x338
[   20.363122]  drm_atomic_helper_commit_tail_rpm+0x20/0x98
[   20.363590]  commit_tail+0x9c/0x164
[   20.363900]  drm_atomic_helper_commit+0x144/0x170
[   20.364315]  drm_atomic_commit+0xa4/0x100
[   20.364673]  drm_atomic_helper_set_config+0x9c/0xec
[   20.365102]  drm_mode_setcrtc+0x1a8/0x6c4
[   20.365457]  drm_ioctl_kernel+0xbc/0x164
[   20.365805]  drm_ioctl+0x214/0x4bc
[   20.366107]  drm_compat_ioctl+0x10c/0x120
[   20.366463]  __arm64_compat_sys_ioctl+0x140/0x160
[   20.366881]  invoke_syscall+0x44/0x108
[   20.367217]  el0_svc_common.constprop.0+0x40/0xd8
[   20.367633]  do_el0_svc_compat+0x18/0x38
[   20.367982]  el0_svc_compat+0x14/0x48
[   20.368311]  el0t_32_sync_handler+0x88/0x114
[   20.368690]  el0t_32_sync+0x150/0x154
[   20.369016] ---[ end trace 0000000000000000 ]---

mrfixit2001 commented 1 year ago

@giuliobenetti

Additional detail - the above iommu error is thrown after attempting to exit ffplay which uses SDL2. The audio plays but the video is blank and then errors out after.

I get a completely different error when I try to start KODI - which does NOT use SDL2 at all - it sends all it's display directly to drm / gbm. The below basically repeats over and over:

[  775.614898] mali ff9a0000.gpu: AS_ACTIVE bit stuck
[  775.615339] mali ff9a0000.gpu: Flush for GPU page table update did not complete. Issueing GPU soft-reset to recover
[  775.616264] mali ff9a0000.gpu: Preparing to soft-reset GPU: Waiting (upto 3000 ms) for all jobs to complete soft-stop
[  775.737509] mali ff9a0000.gpu: AS_ACTIVE bit stuck
[  775.757568] mali ff9a0000.gpu: AS_ACTIVE bit stuck
[  775.777625] mali ff9a0000.gpu: AS_ACTIVE bit stuck
[  775.778047] mali ff9a0000.gpu: Flush for GPU page table update did not complete. Issueing GPU soft-reset to recover
[  775.898797] mali ff9a0000.gpu: AS_ACTIVE bit stuck
[  775.918893] mali ff9a0000.gpu: AS_ACTIVE bit stuck
[  775.938984] mali ff9a0000.gpu: AS_ACTIVE bit stuck
[  775.939405] mali ff9a0000.gpu: Flush for GPU page table update did not complete. Issueing GPU soft-reset to recover
[  778.617254] mali ff9a0000.gpu: Resetting GPU (allowing up to 500 ms)
[  778.617821] mali ff9a0000.gpu: Register state:
[  778.618212] mali ff9a0000.gpu:   GPU_IRQ_RAWSTAT=0x00000200 GPU_STATUS=0x00000009
[  778.618867] mali ff9a0000.gpu:   JOB_IRQ_RAWSTAT=0x00000000 JOB_IRQ_JS_STATE=0x00000002
[  778.619580] mali ff9a0000.gpu:   JS0_STATUS=0x00000000      JS0_HEAD_LO=0x00000000
[  778.620244] mali ff9a0000.gpu:   JS1_STATUS=0x00000008      JS1_HEAD_LO=0xf63c8500
[  778.620906] mali ff9a0000.gpu:   JS2_STATUS=0x00000000      JS2_HEAD_LO=0x00000000
[  778.621568] mali ff9a0000.gpu:   MMU_IRQ_RAWSTAT=0x00000000 GPU_FAULTSTATUS=0x00000000
[  778.622260] mali ff9a0000.gpu:   GPU_IRQ_MASK=0x00000000    JOB_IRQ_MASK=0x00000000     MMU_IRQ_MASK=0x00000000
[  778.623150] mali ff9a0000.gpu:   PWR_OVERRIDE0=0x00000000   PWR_OVERRIDE1=0x00000000
[  778.623827] mali ff9a0000.gpu:   SHADER_CONFIG=0x00010000   L2_MMU_CONFIG=0x00000000
[  778.624504] mali ff9a0000.gpu:   TILER_CONFIG=0x00000001    JM_CONFIG=0x00000038
[  778.625175] mali ff9a0000.gpu: t6xx: GPU fault 0x4002 from job slot 1
[  779.125182] mali ff9a0000.gpu: Failed to soft-reset GPU (timed out after 500 ms), now attempting a hard reset
[  779.126104] mali ff9a0000.gpu: Reset complete
[  779.126550] mali ff9a0000.gpu: t6xx: GPU fault 0x4002 from job slot 0
[  784.265095] mali ff9a0000.gpu: JS: Job Hard-Stopped (took more than 50 ticks at 100 ms/tick)
[  784.765857] mali ff9a0000.gpu: JS: Job has been on the GPU for too long (JS_RESET_TICKS_SS/DUMPING timeout hit). Issueing GPU soft-reset to resolve.

Any insight or ideas is welcome.

mrfixit2001 commented 1 year ago

I'll give you one more error to reference as well... Let me know if you have any ideas based on any of this...

I tried to increased KBASE_AS_INACTIVE_MAX_LOOPS so it wouldn't think the gpu was stuck, but that either had no effect or caused the board to simply lock up without throwing an error...

But if I first trigger the iommu error with ffplay, and THEN try and start kodi, I get yet a completely different error :)

[  106.562177] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  106.562968] Mem abort info:
[  106.563214]   ESR = 0x0000000096000007
[  106.563544]   EC = 0x25: DABT (current EL), IL = 32 bits
[  106.564008]   SET = 0, FnV = 0
[  106.564277]   EA = 0, S1PTW = 0
[  106.564554]   FSC = 0x07: level 3 translation fault
[  106.564982] Data abort info:
[  106.565235]   ISV = 0, ISS = 0x00000007, ISS2 = 0x00000000
[  106.565715]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  106.566158]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  106.566623] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000168b0000
[  106.567186] [0000000000000000] pgd=08000000168a8003, p4d=08000000168a8003, pud=0800000007197003, pmd=08000000168bd003, pte=0000000000000000
[  106.568288] Internal error: Oops: 0000000096000007 [#1] SMP
[  106.568776] Modules linked in: 8021q btsdio hci_uart btqca btusb btrtl btbcm btintel bluetooth ecdh_generic ecc fusb302 tcpm rk_crypto spi_rockchip pwm_fan rk3399_dmc crypto_engine
[  106.570218] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G     U  W          6.5.0 #39
[  106.570867] Hardware name: Pine64 RockPro64 v2.1 (DT)
[  106.571310] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  106.571921] pc : timekeeping_advance+0x7c/0x558
[  106.572332] lr : update_wall_time+0x10/0x2c
[  106.572704] sp : ffff800081923b10
[  106.572996] x29: ffff800081923b10 x28: ffff800081931600 x27: 0000000000000400
[  106.573625] x26: ffff800081aeb000 x25: ffff0000f77bff80 x24: 0000000000000000
[  106.574255] x23: ffff0000f7750848 x22: 00000018cf79f264 x21: ffff800081aebc00
[  106.574883] x20: ffff800081926000 x19: 0000000000000001 x18: 0000000000000000
[  106.575511] x17: 0000000000000000 x16: 0000000000000000 x15: 00000000000003f0
[  106.576139] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000001
[  106.576767] x11: 0000000000000002 x10: 0000000000000960 x9 : ffff800081926000
[  106.577395] x8 : 00000000000000c0 x7 : 00000000ffff216e x6 : ffff800081ac9a30
[  106.578025] x5 : 00ffffffffffffff x4 : ffff800081aebd28 x3 : 0000000000000000
[  106.578653] x2 : 0000000000000001 x1 : 0000000000000000 x0 : 0000000000000000
[  106.579283] Call trace:
[  106.579499]  timekeeping_advance+0x7c/0x558
[  106.579871]  update_wall_time+0x10/0x2c
[  106.580211]  tick_do_update_jiffies64+0xe4/0x150
[  106.580623]  tick_irq_enter+0x7c/0xac
[  106.580950]  irq_enter_rcu+0x60/0x74
[  106.581268]  el1_interrupt+0x24/0x4c
[  106.581587]  el1h_64_irq_handler+0x14/0x1c
[  106.581948]  el1h_64_irq+0x64/0x68
[  106.582252]  default_idle_call+0x24/0x34
[  106.582601]  do_idle+0xa4/0xf4
[  106.582873]  cpu_startup_entry+0x24/0x28
[  106.583220]  kernel_init+0x0/0x1cc
[  106.583521]  arch_post_acpi_subsys_init+0x0/0x8
[  106.583925]  start_kernel+0x4a4/0x57c
[  106.584251]  __primary_switched+0xb4/0xbc
[  106.584612] Code: 350020e1 f9409ea0 a90363f7 52000273 (f9400001) 
[  106.585146] ---[ end trace 0000000000000000 ]---
[  106.585551] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[  106.586151] SMP: stopping secondary CPUs
[  107.753162] SMP: failed to stop secondary CPUs 0,2-5
[  107.753598] Kernel Offset: disabled
[  107.753905] CPU features: 0x40000104,1a000000,0800400b
[  107.754356] Memory Limit: none
[  107.754628] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---

mrfixit2001 commented 1 year ago

I am also wondering if there is anything we need to change to integrate with this in the upstream kernel: https://github.com/torvalds/linux/commit/d08d42de6432d5064045159aed060e3db9fa7807

mrfixit2001 commented 12 months ago

@giuliobenetti You can close this issue - I was able to make everything work after my new patches on kernel 6.6 :)

The above mentioned errors were kind of misleading and ultimately the issue was actually related to a change in the kernel power management driver - not having to do with mali at all.

In any event - the initially defined error in this issue was resolved by my PR. I'll leave the rest of the error outputs for future reference in case anyone is googling and this comes up.

When you have time, I encourage you to refactored and adjusted dma fence to always call dma_resv_reserve_fences for both read and write dma reservations, as well as update kbase_dma_fence_lock_reservations to use dma resv locks instead of ww mutex - because this is what other upstream DRM drivers are doing.

Thanks again for all your help and time!

giuliobenetti commented 11 months ago

@giuliobenetti You can close this issue - I was able to make everything work after my new patches on kernel 6.6 :)

Very well! Can you please point the Linux patches you're referring to? Once I have those patches I test on my board then.

The above mentioned errors were kind of misleading and ultimately the issue was actually related to a change in the kernel power management driver - not having to do with mali at all.

Ok, good to know

In any event - the initially defined error in this issue was resolved by my PR. I'll leave the rest of the error outputs for future reference in case anyone is googling and this comes up.

Thanks a lot for providing such fix

When you have time, I encourage you to refactored and adjusted dma fence to always call dma_resv_reserve_fences for both read and write dma reservations, as well as update kbase_dma_fence_lock_reservations to use dma resv locks instead of ww mutex - because this is what other upstream DRM drivers are doing.

Can you please point me some example here? Or if you have it would be great if you can contribute with patches since you've already faced the problem.

Thanks again for all your help and time!

Thank you too :-)

bootlin / mali-driver

Linux 6.2 issue #7

else