PeterSuh-Q3 / tinycore-redpill

GNU General Public License v3.0
230 stars 32 forks source link

Strange freezing with MPT3SAS on SA6400 #22

Closed kills closed 5 months ago

kills commented 6 months ago

Hi Peter,

Thank you for this awesome loader. I wanted to let you know I've run into some strange freezing and lock ups, while trying to run raid rebuilds and data scrubbing using SA6400.

Is it possible that the MPT3SAS Drivers have some issues still? Below are some of the logs I captured:

2024-01-03T14:46:51+11:00 Beast kernel: [19655.913031] BUG: kernel NULL pointer dereference, address: 0000000000000428
2024-01-03T14:46:51+11:00 Beast kernel: [19655.920157] #PF: supervisor read access in kernel mode
2024-01-03T14:46:51+11:00 Beast kernel: [19655.925382] #PF: error_code(0x0000) - not-present page
2024-01-03T14:46:51+11:00 Beast kernel: [19655.933253] Oops: 0000 [#1] SMP NOPTI
2024-01-03T14:46:51+11:00 Beast kernel: [19655.937018] CPU: 0 PID: 9273 Comm: kworker/u24:3 Tainted: P           OE     5.10.55+ #69057
2024-01-03T14:46:51+11:00 Beast kernel: [19655.945742] Hardware name: System manufacturer System Product Name/ROG STRIX Z370-F GAMING, BIOS 3004 07/12/2021
2024-01-03T14:46:51+11:00 Beast kernel: [19655.956158] Workqueue: raid5wq raid5_do_work [raid456]
2024-01-03T14:46:51+11:00 Beast kernel: [19655.961470] RIP: 0010:handle_stripe_clean_event+0x1ae/0x430 [raid456]
2024-01-03T14:46:51+11:00 Beast kernel: [19655.968129] Code: 87 08 01 00 00 4d 8b b7 30 01 00 00 49 c7 87 30 01 00 00 00 00 00 00 49 89 87 00 01 00 00 4d 85 f6 74 4e 49 8b 87 38 01 00 00 <49> 8b 76 28 48 8b 7b 28 48 8d 50 08 48 39 d6 73 3a 41 8b 46 30 c1
2024-01-03T14:46:51+11:00 Beast kernel: [19655.987458] RSP: 0018:ffffc9000122bbb0 EFLAGS: 00010206
2024-01-03T14:46:51+11:00 Beast kernel: [19655.992837] RAX: 00000003e8c975f0 RBX: ffff88811cdec800 RCX: 0000000000000001
2024-01-03T14:46:51+11:00 Beast kernel: [19656.000027] RDX: 0000000000000002 RSI: ffff88812cc32e80 RDI: ffff88811cdec800
2024-01-03T14:46:51+11:00 Beast kernel: [19656.007331] RBP: ffff88812cc32e80 R08: 0000000000000000 R09: ffff88812cc330f2
2024-01-03T14:46:51+11:00 Beast kernel: [19656.014575] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88812cc32e80
2024-01-03T14:46:51+11:00 Beast kernel: [19656.021877] R13: 0000000000000005 R14: 0000000000000400 R15: ffff88812cc33250
2024-01-03T14:46:51+11:00 Beast kernel: [19656.029158] FS:  0000000000000000(0000) GS:ffff888fe8c00000(0000) knlGS:0000000000000000
2024-01-03T14:46:51+11:00 Beast kernel: [19656.037379] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-01-03T14:46:51+11:00 Beast kernel: [19656.043228] CR2: 0000000000000428 CR3: 000000082aa0a003 CR4: 00000000003706f0
2024-01-03T14:46:51+11:00 Beast kernel: [19656.050427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2024-01-03T14:46:51+11:00 Beast kernel: [19656.057721] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2024-01-03T14:46:51+11:00 Beast kernel: [19656.065033] Call Trace:
2024-01-03T14:46:51+11:00 Beast kernel: [19656.067594]  [<ffffffffa1167ec4>] handle_stripe+0xed4/0x2070 [raid456]
2024-01-03T14:46:51+11:00 Beast kernel: [19656.074289]  [<ffffffff810aecb3>] ? __wake_up_common+0x73/0x130
2024-01-03T14:46:51+11:00 Beast kernel: [19656.080280]  [<ffffffffa1169361>] handle_active_stripes.constprop.0+0x301/0x4a0 [raid456]
2024-01-03T14:46:51+11:00 Beast kernel: [19656.088643]  [<ffffffffa116a07c>] raid5_do_work+0x8c/0x1a0 [raid456]
2024-01-03T14:46:51+11:00 Beast kernel: [19656.095149]  [<ffffffff8108e2f8>] worker_run_work+0xb8/0xd0
2024-01-03T14:46:51+11:00 Beast kernel: [19656.100887]  [<ffffffffa1169ff0>] ? raid5d+0xaf0/0xaf0 [raid456]
2024-01-03T14:46:51+11:00 Beast kernel: [19656.107092]  [<ffffffff81080bd3>] process_one_work+0x1d3/0x3c0
2024-01-03T14:46:51+11:00 Beast kernel: [19656.113032]  [<ffffffff81080e0d>] worker_thread+0x4d/0x3e0
2024-01-03T14:46:51+11:00 Beast kernel: [19656.118632]  [<ffffffff81080dc0>] ? process_one_work+0x3c0/0x3c0
2024-01-03T14:46:51+11:00 Beast kernel: [19656.124716]  [<ffffffff8108684d>] kthread+0x12d/0x150
2024-01-03T14:46:51+11:00 Beast kernel: [19656.129879]  [<ffffffff81086720>] ? __kthread_bind_mask+0x60/0x60
2024-01-03T14:46:51+11:00 Beast kernel: [19656.136181]  [<ffffffff8100174f>] ret_from_fork+0x1f/0x30
2024-01-03T14:46:51+11:00 Beast kernel: [19656.141682] Modules linked in: nfnetlink xfrm_user xfrm_algo xt_conntrack xt_addrtype fuse ip6table_filter ip6_tables xt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables tun nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(OE) nvidia(OE) udf isofs cmac cifs synoacl_vfs(O) btrfs adt7475 hwmon_vid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 8021q vfat fat bonding leds_atmega1608 lrw gf128mul epyc7002_synobios(O) uhci_hcd ehci_pci ehci_hcd be2net bnxt_en(O) bnx2x(O) qede(O) qed(O) mlx5_core(O) mlx4_en(O) mlx4_core(O) atlantic_v2(O) atlantic(O) r8168(O) ixgbe(O) i2c_algo_bit igb(O) dm_crypt essiv sg dm_snapshot dm_bufio crc_itu_t crc_ccitt psnap p8022 llc hfsplus md4 hmac sit tunnel4 ipv6 flashcache_syno(O) syno_flashcache_control(O) dm_mod zstd zstd_compress lzo_rle lzo aesni_intel glue_helper crypto_simd libarc4
2024-01-03T14:46:51+11:00 Beast kernel: [19656.141717]  cryptd ecb authenc des_generic libdes ansi_cprng cts md5 cbc cpufreq_powersave cpufreq_performance vxlan ip6_udp_tunnel udp_tunnel zram ip_tunnel loop sha256_generic libsha256(OE) synorbd(O) synofsbd(O) mei_pxp(OE) mei_hdcp(OE) uas mei_me(OE) usbhid mei(OE) mpt3sas(OE) hid dmabuf(OE) scsi_transport_sas(OE) e1000e(OE) thermal(OE) usb_storage rtc_cmos(OE) button(OE) xhci_pci xhci_hcd usbcore usb_common [last unloaded: veth]
2024-01-03T14:46:51+11:00 Beast kernel: [19656.269766] CR2: 0000000000000428
2024-01-03T14:46:51+11:00 Beast kernel: [19656.273154] ---[ end trace cbba4ee9d6caaffa ]---
2024-01-03T14:46:51+11:00 Beast kernel: [19656.277855] RIP: 0010:handle_stripe_clean_event+0x1ae/0x430 [raid456]
2024-01-03T14:46:51+11:00 Beast kernel: [19656.284421] Code: 87 08 01 00 00 4d 8b b7 30 01 00 00 49 c7 87 30 01 00 00 00 00 00 00 49 89 87 00 01 00 00 4d 85 f6 74 4e 49 8b 87 38 01 00 00 <49> 8b 76 28 48 8b 7b 28 48 8d 50 08 48 39 d6 73 3a 41 8b 46 30 c1
2024-01-03T14:46:51+11:00 Beast kernel: [19656.303495] RSP: 0018:ffffc9000122bbb0 EFLAGS: 00010206
2024-01-03T14:46:51+11:00 Beast kernel: [19656.308806] RAX: 00000003e8c975f0 RBX: ffff88811cdec800 RCX: 0000000000000001
2024-01-03T14:46:51+11:00 Beast kernel: [19656.316050] RDX: 0000000000000002 RSI: ffff88812cc32e80 RDI: ffff88811cdec800
2024-01-03T14:46:51+11:00 Beast kernel: [19656.323302] RBP: ffff88812cc32e80 R08: 0000000000000000 R09: ffff88812cc330f2
2024-01-03T14:46:51+11:00 Beast kernel: [19656.330578] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88812cc32e80
2024-01-03T14:46:51+11:00 Beast kernel: [19656.337838] R13: 0000000000000005 R14: 0000000000000400 R15: ffff88812cc33250
2024-01-03T14:46:51+11:00 Beast kernel: [19656.345089] FS:  0000000000000000(0000) GS:ffff888fe8c00000(0000) knlGS:0000000000000000
2024-01-03T14:46:51+11:00 Beast kernel: [19656.353328] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-01-03T14:46:51+11:00 Beast kernel: [19656.359185] CR2: 0000000000000428 CR3: 000000082aa0a003 CR4: 00000000003706f0
2024-01-03T14:46:51+11:00 Beast kernel: [19656.366434] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2024-01-03T14:46:51+11:00 Beast kernel: [19656.373704] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2024-01-03T14:46:51+11:00 Beast kernel: [19656.380967] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49
2024-01-03T14:46:51+11:00 Beast kernel: [19656.389967] in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 9273, name: kworker/u24:3
2024-01-03T14:46:51+11:00 Beast kernel: [19656.398546] CPU: 0 PID: 9273 Comm: kworker/u24:3 Tainted: P      D    OE     5.10.55+ #69057
2024-01-03T14:46:51+11:00 Beast kernel: [19656.407095] Hardware name: System manufacturer System Product Name/ROG STRIX Z370-F GAMING, BIOS 3004 07/12/2021
2024-01-03T14:46:51+11:00 Beast kernel: [19656.417441] Workqueue: raid5wq raid5_do_work [raid456]
2024-01-03T14:46:51+11:00 Beast kernel: [19656.422668] Call Trace:
2024-01-03T14:46:51+11:00 Beast kernel: [19656.425167]  [<ffffffff81769200>] dump_stack+0x57/0x6a
2024-01-03T14:46:51+11:00 Beast kernel: [19656.430406]  [<ffffffff817597b4>] ___might_sleep.cold+0x8c/0x9d
2024-01-03T14:46:51+11:00 Beast kernel: [19656.436447]  [<ffffffff8107598c>] exit_signals+0x1c/0x1a0
2024-01-03T14:46:51+11:00 Beast kernel: [19656.441948]  [<ffffffff81068628>] do_exit+0xb8/0x9b0
2024-01-03T14:46:51+11:00 Beast kernel: [19656.447008]  [<ffffffff81080dc0>] ? process_one_work+0x3c0/0x3c0
2024-01-03T14:46:51+11:00 Beast kernel: [19656.453125]  [<ffffffff8108684d>] ? kthread+0x12d/0x150
2024-01-03T14:46:51+11:00 Beast kernel: [19656.458445]  [<ffffffff81001777>] rewind_stack_do_exit+0x17/0x20
2024-01-03T14:46:51+11:00 Beast kernel: [19656.464552] RIP: 0000:0x0
2024-01-03T14:46:51+11:00 Beast kernel: [19656.467218] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
2024-01-03T14:46:51+11:00 Beast kernel: [19656.474209] RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000
2024-01-03T14:46:51+11:00 Beast kernel: [19656.481911] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
2024-01-03T14:46:51+11:00 Beast kernel: [19656.489213] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
2024-01-03T14:46:51+11:00 Beast kernel: [19656.496473] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
2024-01-03T14:46:51+11:00 Beast kernel: [19656.503744] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
2024-01-03T14:46:51+11:00 Beast kernel: [19656.510978] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Also captured some other messages:

2024-01-03T15:20:17+11:00 Beast kernel: [  202.204672] synostgd-space[27700]: segfault at 4e46 ip 0000000000004e46 sp 00007fff19310d38 error 14 in synostoraged[55ce72fbd000+6000]
2024-01-03T15:20:17+11:00 Beast kernel: [  202.204676] Code: Unable to access opcode bytes at RIP 0x4e1c.
2024-01-03T15:20:17+11:00 Beast coredump[27711]: Process synostgd-space[27700](/usr/syno/sbin/synostoraged) dumped core on signal [11]. Core file [/volume2/@synostgd-space.synology_epyc7002_sa6400.69057.core.gz]. Cmdline [synostgd-space              ]
2024-01-03T15:20:17+11:00 Beast coredump[27711]: Core file [/volume2/@synostgd-space.synology_epyc7002_sa6400.69057.core.gz] size [804463]
2024-01-03T15:21:53+11:00 Beast kernel: [  298.353449] synostgd-space[29481]: segfault at 4e46 ip 0000000000004e46 sp 00007fff19310d38 error 14 in synostoraged[55ce72fbd000+6000]
2024-01-03T15:21:53+11:00 Beast kernel: [  298.365781] Code: Unable to access opcode bytes at RIP 0x4e1c.
2024-01-03T15:21:53+11:00 Beast coredump[29487]: Process synostgd-space[29481](/usr/syno/sbin/synostoraged) dumped core on signal [11]. Core file [/volume2/@synostgd-space.synology_epyc7002_sa6400.69057.core.gz]. Cmdline [synostgd-space              ]
2024-01-03T15:21:53+11:00 Beast coredump[29487]: Core file [/volume2/@synostgd-space.synology_epyc7002_sa6400.69057.core.gz] size [804658]

I have seen some of the other loaders, RR & ARC, mention they've updated their MPT3SAS Drivers recently due to some issues with the SA6400. Is it possible that there could be a bug with the current ones in your tinycore loader?

Appreciate any help you can provide.

Thank you,

Kills

kills commented 6 months ago

Just to add to the above, created a new bootloader and migrated the data to a DS3622xs and its successfully rebuilding and scrubbing without any freezing. I retained the existing configurations when migrating. So likely there is something up with the SA6400 configuration.

PeterSuh-Q3 commented 6 months ago

The logs you provided don't seem to show any clues about a kernel panic directly about mpt3sas. I often see raid456, so I am suspicious of this module. First, let's check the latest version of rr's mpt3sas module and include it again in sa6400.

kills commented 6 months ago

Thank you for your dedication to this project Peter.

Just to add, the above errors were from my main NAS.

I have built a test NAS on some spare parts I had available, including an Avago mpt3sas raid card with 5 disks attached. Happy to test any beta or potential new drivers if you like.

I was interested to see how RR performs and grabbed the latest RR today (24.1.0) and have that running a SA6400 configuration for the past 6 hours, no crashes so far. I will continue to monitor and report back any findings.

Did you end up seeing any differences between the driver versions?

Update: Scrubbing completed successfully without any freezes or errors.

PeterSuh-Q3 commented 6 months ago

In the following content included in the last release of Mshell: Extract mpt3sas.ko and

https://github.com/PeterSuh-Q3/arpl-modules/releases/download/v1.64/epyc7002-7.2-5.10.55.tgz

I extracted the same mpt3sas.ko from the rr 24.1.0 integrated module and compared it. Unfortunately, nothing has changed.

I also have Dell Perc H200, H310 cards that use the same chipset as yours. Since it has been confirmed that SHELL is using the same module, we will install the disk and monitor its stability for a few days.

PeterSuh-Q3 commented 4 months ago

Today, in mshell, we made it almost identical to the mechanism of rr so that the HBA operates stably. All HBAs that did not work until now will work. If possible, please test it.