RedPill-TTG / redpill-lkm

Linux kernel module for RedPill
GNU General Public License v3.0
307 stars 174 forks source link

Kernel Panic or Lockup on DS3615xs image #21

Open WiteWulf opened 2 years ago

WiteWulf commented 2 years ago

A number of users on the forum are reporting kernel panics on baremetal installs, or lockups of guests on virtualisation platforms, when running the DS3615xs image specifically. This is typically precipitated by running docker in general, or certain docker images, but may also have been caused by high IO load in some circumstances. A common feature is use of databases (notably influxdb, mariadb, mysql and elasticsearch) but also nginx and jdownloader2.

This has been observed on baremetal HP Gen7 and Gen8 servers, proxmox and ESXi with a variety of Xeon CPUs (E3-1265L V2, E3-1270 V2, E3-1241 V3, E3-1220L V2 and E3-1265L V4), Celeron and AMD.

Most users are on DSM7.0.1-RC1, but I also observed this behaviour when on DSM6.2.4

(edit: also confirmed to affect 7.0 beta and 7.0.1, ie. not the release candidate)

Conversely, a number of users with DS918+ images have reported no issues with running docker or known problematic images (in my case influxdb causes a 100% reproducible crash).

On my baremetal HP Gen8 running 6.2.4 I get the following console output before a reboot:

[  191.452302] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3

[  191.487637] CPU: 3 PID: 19775 Comm: containerd-shim Tainted: PF          O 3.10.105 #25556

[  191.528112] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019

[  191.562597]  ffffffff814c904d ffffffff814c8121 0000000000000010 ffff880109ac8d58

[  191.599118]  ffff880109ac8cf0 0000000000000000 0000000000000003 000000000000002c

[  191.634943]  0000000000000003 ffffffff80000001 0000000000000010 ffff880103817c00

[  191.670604] Call Trace:

[  191.682506]  <NMI>  [<ffffffff814c904d>] ? dump_stack+0xc/0x15

[  191.710494]  [<ffffffff814c8121>] ? panic+0xbb/0x1ce

[  191.735108]  [<ffffffff810a0922>] ? watchdog_overflow_callback+0xb2/0xc0

[  191.768203]  [<ffffffff810b152b>] ? __perf_event_overflow+0x8b/0x240

[  191.799789]  [<ffffffff810b02d4>] ? perf_event_update_userpage+0x14/0xf0

[  191.834349]  [<ffffffff81015411>] ? intel_pmu_handle_irq+0x1d1/0x360

[  191.865505]  [<ffffffff81010026>] ? perf_event_nmi_handler+0x26/0x40

[  191.897683]  [<ffffffff81005fa8>] ? do_nmi+0xf8/0x3e0

[  191.922372]  [<ffffffff814cfa53>] ? end_repeat_nmi+0x1e/0x7e

[  191.950899]  <<EOE>> 

[  191.961095] Rebooting in 3 seconds..

This is seen by others on baremetal when using docker. Virtualisation platform users see 100% CPU usage on their xpenology guest and it becomes unresponsive, requiring a restart of the guest. The majority of kernel panics cite containerd-shim as being at fault, but sometimes (rarely) it will list a process being run inside a docker container (notably influxdb in my case).

This is notably similar to an issue logged with RHEL a number of years ago that they note was fixed in a subsequent kernel release: https://access.redhat.com/solutions/1354963

OrpheeGT commented 2 years ago

As @WiteWulf I have the same issue on ESXi VM. HP Gen8

When I run only simple nginx container... randomly...

[ 923.411516] device dockerdbd5db9 entered promiscuous mode [ 923.413414] IPv6: ADDRCONF(NETDEV_UP): dockerdbd5db9: link is not ready [ 923.533359] <redpill/smart_shim.c:794> Handling ioctl(0x31f) for /dev/sda [ 923.534667] <redpill/smart_shim.c:624> Got SMART command - looking for feature=0xd0 [ 923.536000] <redpill/smart_shim.c:376> Generating fake SMART values [ 924.059100] IPv6: ADDRCONF(NETDEV_CHANGE): dockerdbd5db9: link becomes ready [ 924.060327] docker0: port 1(dockerdbd5db9) entered forwarding state [ 924.061379] docker0: port 1(dockerdbd5db9) entered forwarding state [ 924.062448] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready [ 939.050818] docker0: port 1(dockerdbd5db9) entered forwarding state [ 975.910411] <redpill/smart_shim.c:794> Handling ioctl(0x31f) for /dev/sda [ 975.911848] <redpill/smart_shim.c:624> Got SMART command - looking for feature=0xd0 [ 975.913185] <redpill/smart_shim.c:376> Generating fake SMART values [ 1035.865304] <redpill/smart_shim.c:794> Handling ioctl(0x31f) for /dev/sda [ 1035.866688] <redpill/smart_shim.c:624> Got SMART command - looking for feature=0xd0 [ 1035.867966] <redpill/smart_shim.c:376> Generating fake SMART values [ 1095.820121] <redpill/smart_shim.c:794> Handling ioctl(0x31f) for /dev/sda [ 1095.821465] <redpill/smart_shim.c:624> Got SMART command - looking for feature=0xd0 [ 1095.822764] <redpill/smart_shim.c:376> Generating fake SMART values [ 1167.089552] BUG: soft lockup - CPU#1 stuck for 41s! [runc:19580] [ 1167.090581] Modules linked in: nfnetlink xfrm_user xfrm_algo fuse bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables 8021q vhost_scsi(O) vhost(O) tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) cdc_acm ftdi_sio ch341(OF) cp210x(OF) usbserial udf isofs loop synoacl_vfs(PO) btrfs zstd_decompress ecryptfs zstd_compress xxhash xor raid6_pq aesni_intel glue_helper lrw gf128mul ablk_helper zram(C) bromolow_synobios(PO) hid_generic usbhid hid usblp bnx2x(O) mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) qede(O) qed(O) atlantic_v2(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) i2c_algo_bit igb(O) dca e1000e(O) sg dm_snapshot crc_itu_t crc_ccitt psnap p8022 llc zlib_deflate libcrc32c hfsplus md4 hmac sit tunnel4 ipv6 flashcache_syno(O) flashcache(O) syno_flashcache_control(O) dm_mod crc32c_intel cryptd arc4 sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_powersave cpufreq_performance mperf processor thermal_sys cpufreq_stats freq_table vxlan ip_tunnel vmxnet3(F) etxhci_hcd mpt2sas(O) usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios] [ 1167.115855] CPU: 1 PID: 19580 Comm: runc Tainted: PF C O 3.10.108 #42214 [ 1167.117041] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 [ 1167.118750] task: ffff8800b656c820 ti: ffff880099568000 task.ti: ffff880099568000 [ 1167.119968] RIP: 0010:[] [] generic_exec_single+0x76/0xe0 [ 1167.121380] RSP: 0000:ffff88009956bc20 EFLAGS: 00000202 [ 1167.122253] RAX: 00000000000008fb RBX: 00000037ffffffc8 RCX: 000000000000008a [ 1167.123402] RDX: 0000000000000008 RSI: 00000000000000fb RDI: ffffffff81606628 [ 1167.124552] RBP: ffff88009956bc60 R08: ffff880134888758 R09: 0000000000000020 [ 1167.125708] R10: 0000000000004041 R11: 0000000000000000 R12: 0000004000000001 [ 1167.126858] R13: 0000000000000000 R14: 0000005000000041 R15: ffffffff81894a98 [ 1167.128009] FS: 00007f17f0044740(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000 [ 1167.129322] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1167.130253] CR2: 000000c00024ef68 CR3: 00000000aad04000 CR4: 00000000001607e0 [ 1167.131420] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1167.132574] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1167.133726] Stack: [ 1167.134067] 0000000000000001 ffff88009956bcb0 0000000000000000 ffffffff818a90d0 [ 1167.135366] ffffffff8102fcc0 ffffffff8109007e 0000000000000000 ffffffff818a90d0 [ 1167.136656] ffff88013dc13980 ffff88013dc13980 ffffffff8102fcc0 ffff88009956bcc0 [ 1167.137952] Call Trace: [ 1167.138364] [] ? do_flush_tlb_all+0x160/0x160 [ 1167.139339] [] ? smp_call_function_single+0x12e/0x150 [ 1167.140431] [] ? do_flush_tlb_all+0x160/0x160 [ 1167.141413] [] ? flush_tlb_page+0x72/0x130 [ 1167.142352] [] ? ptep_clear_flush+0x22/0x30 [ 1167.143304] [] ? do_wp_page+0x2ad/0x8c0 [ 1167.144194] [] ? handle_pte_fault+0x38d/0x9e0 [ 1167.145166] [] ? handle_mm_fault+0x135/0x2e0 [ 1167.146123] [] ? do_page_fault+0x14a/0x500 [ 1167.147094] [] ? vfs_read+0x140/0x170 [ 1167.147956] [] ? SyS_read+0x84/0xb0 [ 1167.148793] [] ? page_fault+0x22/0x30 [ 1167.149660] Code: 89 55 08 48 89 2a e8 8a 78 41 00 4c 39 f3 75 0f 44 89 e7 48 8b 05 fb f1 78 00 e8 96 4d 20 00 f6 45 20 01 74 08 f3 90 f6 45 20 01 <75> f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00 00 4c 8d 6b [ 1215.050796] BUG: soft lockup - CPU#1 stuck for 41s! [runc:19580] [ 1215.051822] Modules linked in: nfnetlink xfrm_user xfrm_algo fuse bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables 8021q vhost_scsi(O) vhost(O) tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) cdc_acm ftdi_sio ch341(OF) cp210x(OF) usbserial udf isofs loop synoacl_vfs(PO) btrfs zstd_decompress ecryptfs zstd_compress xxhash xor raid6_pq aesni_intel glue_helper lrw gf128mul ablk_helper zram(C) bromolow_synobios(PO) hid_generic usbhid hid usblp bnx2x(O) mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) qede(O) qed(O) atlantic_v2(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) i2c_algo_bit igb(O) dca e1000e(O) sg dm_snapshot crc_itu_t crc_ccitt psnap p8022 llc zlib_deflate libcrc32c hfsplus md4 hmac sit tunnel4 ipv6 flashcache_syno(O) flashcache(O) syno_flashcache_control(O) dm_mod crc32c_intel cryptd arc4 sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_powersave cpufreq_performance mperf processor thermal_sys cpufreq_stats freq_table vxlan ip_tunnel vmxnet3(F) etxhci_hcd mpt2sas(O) usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios] [ 1215.076956] CPU: 1 PID: 19580 Comm: runc Tainted: PF C O 3.10.108 #42214 [ 1215.078142] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 [ 1215.079854] task: ffff8800b656c820 ti: ffff880099568000 task.ti: ffff880099568000 [ 1215.081064] RIP: 0010:[] [] generic_exec_single+0x76/0xe0 [ 1215.082475] RSP: 0000:ffff88009956bc20 EFLAGS: 00000202 [ 1215.083338] RAX: 00000000000008fb RBX: 00000037ffffffc8 RCX: 000000000000008a [ 1215.084487] RDX: 0000000000000008 RSI: 00000000000000fb RDI: ffffffff81606628 [ 1215.085646] RBP: ffff88009956bc60 R08: ffff880134888758 R09: 0000000000000020 [ 1215.086797] R10: 0000000000004041 R11: 0000000000000000 R12: 0000004000000001 [ 1215.087947] R13: 0000000000000000 R14: 0000005000000041 R15: ffffffff81894a98 [ 1215.089099] FS: 00007f17f0044740(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000 [ 1215.090400] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1215.091331] CR2: 000000c00024ef68 CR3: 00000000aad04000 CR4: 00000000001607e0 [ 1215.092492] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1215.093658] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1215.094824] Stack: [ 1215.095165] 0000000000000001 ffff88009956bcb0 0000000000000000 ffffffff818a90d0 [ 1215.096461] ffffffff8102fcc0 ffffffff8109007e 0000000000000000 ffffffff818a90d0 [ 1215.097748] ffff88013dc13980 ffff88013dc13980 ffffffff8102fcc0 ffff88009956bcc0 [ 1215.099035] Call Trace: [ 1215.099449] [] ? do_flush_tlb_all+0x160/0x160 [ 1215.100423] [] ? smp_call_function_single+0x12e/0x150 [ 1215.101505] [] ? do_flush_tlb_all+0x160/0x160 [ 1215.102478] [] ? flush_tlb_page+0x72/0x130 [ 1215.103412] [] ? ptep_clear_flush+0x22/0x30 [ 1215.104354] [] ? do_wp_page+0x2ad/0x8c0 [ 1215.105244] [] ? handle_pte_fault+0x38d/0x9e0 [ 1215.106221] [] ? handle_mm_fault+0x135/0x2e0 [ 1215.107180] [] ? do_page_fault+0x14a/0x500 [ 1215.108140] [] ? vfs_read+0x140/0x170 [ 1215.109001] [] ? SyS_read+0x84/0xb0 [ 1215.109836] [] ? page_fault+0x22/0x30 [ 1215.110699] Code: 89 55 08 48 89 2a e8 8a 78 41 00 4c 39 f3 75 0f 44 89 e7 48 8b 05 fb f1 78 00 e8 96 4d 20 00 f6 45 20 01 74 08 f3 90 f6 45 20 01 <75> f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00 00 4c 8d 6b [ 1263.012041] BUG: soft lockup - CPU#1 stuck for 41s! [runc:19580] [ 1263.013068] Modules linked in: nfnetlink xfrm_user xfrm_algo fuse bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables 8021q vhost_scsi(O) vhost(O) tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) cdc_acm ftdi_sio ch341(OF) cp210x(OF) usbserial udf isofs loop synoacl_vfs(PO) btrfs zstd_decompress ecryptfs zstd_compress xxhash xor raid6_pq aesni_intel glue_helper lrw gf128mul ablk_helper zram(C) bromolow_synobios(PO) hid_generic usbhid hid usblp bnx2x(O) mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) qede(O) qed(O) atlantic_v2(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) i2c_algo_bit igb(O) dca e1000e(O) sg dm_snapshot crc_itu_t crc_ccitt psnap p8022 llc zlib_deflate libcrc32c hfsplus md4 hmac sit tunnel4 ipv6 flashcache_syno(O) flashcache(O) syno_flashcache_control(O) dm_mod crc32c_intel cryptd arc4 sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_powersave cpufreq_performance mperf processor thermal_sys cpufreq_stats freq_table vxlan ip_tunnel vmxnet3(F) etxhci_hcd mpt2sas(O) usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios] [ 1263.038226] CPU: 1 PID: 19580 Comm: runc Tainted: PF C O 3.10.108 #42214 [ 1263.039404] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 [ 1263.041115] task: ffff8800b656c820 ti: ffff880099568000 task.ti: ffff880099568000 [ 1263.042323] RIP: 0010:[] [] generic_exec_single+0x72/0xe0 [ 1263.043737] RSP: 0000:ffff88009956bc20 EFLAGS: 00000202 [ 1263.044601] RAX: 00000000000008fb RBX: 00000037ffffffc8 RCX: 000000000000008a [ 1263.045752] RDX: 0000000000000008 RSI: 00000000000000fb RDI: ffffffff81606628 [ 1263.046902] RBP: ffff88009956bc60 R08: ffff880134888758 R09: 0000000000000020 [ 1263.048050] R10: 0000000000004041 R11: 0000000000000000 R12: 0000004000000001 [ 1263.049200] R13: 0000000000000000 R14: 0000005000000041 R15: ffffffff81894a98 [ 1263.050353] FS: 00007f17f0044740(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000 [ 1263.051657] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1263.052590] CR2: 000000c00024ef68 CR3: 00000000aad04000 CR4: 00000000001607e0 [ 1263.053755] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1263.054916] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1263.056066] Stack: [ 1263.056408] 0000000000000001 ffff88009956bcb0 0000000000000000 ffffffff818a90d0 [ 1263.057707] ffffffff8102fcc0 ffffffff8109007e 0000000000000000 ffffffff818a90d0 [ 1263.058996] ffff88013dc13980 ffff88013dc13980 ffffffff8102fcc0 ffff88009956bcc0 [ 1263.060286] Call Trace: [ 1263.060700] [] ? do_flush_tlb_all+0x160/0x160 [ 1263.061676] [] ? smp_call_function_single+0x12e/0x150 [ 1263.062757] [] ? do_flush_tlb_all+0x160/0x160 [ 1263.063729] [] ? flush_tlb_page+0x72/0x130 [ 1263.064661] [] ? ptep_clear_flush+0x22/0x30 [ 1263.065607] [] ? do_wp_page+0x2ad/0x8c0 [ 1263.066499] [] ? handle_pte_fault+0x38d/0x9e0 [ 1263.067473] [] ? handle_mm_fault+0x135/0x2e0 [ 1263.068432] [] ? do_page_fault+0x14a/0x500 [ 1263.069392] [] ? vfs_read+0x140/0x170 [ 1263.070257] [] ? SyS_read+0x84/0xb0 [ 1263.071093] [] ? page_fault+0x22/0x30 [ 1263.071959] Code: 89 5d 00 48 89 55 08 48 89 2a e8 8a 78 41 00 4c 39 f3 75 0f 44 89 e7 48 8b 05 fb f1 78 00 e8 96 4d 20 00 f6 45 20 01 74 08 f3 90 45 20 01 75 f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00 [ 1310.973284] BUG: soft lockup - CPU#1 stuck for 41s! [runc:19580] [ 1310.974511] Modules linked in: nfnetlink xfrm_user xfrm_algo fuse bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables 8021q vhost_scsi(O) vhost(O) tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) cdc_acm ftdi_sio ch341(OF) cp210x(OF) usbserial udf isofs loop synoacl_vfs(PO) btrfs zstd_decompress ecryptfs zstd_compress xxhash xor raid6_pq aesni_intel glue_helper lrw gf128mul ablk_helper zram(C) bromolow_synobios(PO) hid_generic usbhid hid usblp bnx2x(O) mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) qede(O) qed(O) atlantic_v2(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) i2c_algo_bit igb(O) dca e1000e(O) sg dm_snapshot crc_itu_t crc_ccitt psnap p8022 llc zlib_deflate libcrc32c hfsplus md4 hmac sit tunnel4 ipv6 flashcache_syno(O) flashcache(O) syno_flashcache_control(O) dm_mod crc32c_intel cryptd arc4 sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_powersave cpufreq_performance mperf processor thermal_sys cpufreq_stats freq_table vxlan ip_tunnel vmxnet3(F) etxhci_hcd mpt2sas(O) usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios] [ 1310.999665] CPU: 1 PID: 19580 Comm: runc Tainted: PF C O 3.10.108 #42214 [ 1311.000847] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 [ 1311.002566] task: ffff8800b656c820 ti: ffff880099568000 task.ti: ffff880099568000 [ 1311.003774] RIP: 0010:[] [] generic_exec_single+0x70/0xe0 [ 1311.005185] RSP: 0000:ffff88009956bc20 EFLAGS: 00000202 [ 1311.006051] RAX: 00000000000008fb RBX: 00000037ffffffc8 RCX: 000000000000008a [ 1311.007199] RDX: 0000000000000008 RSI: 00000000000000fb RDI: ffffffff81606628 [ 1311.008351] RBP: ffff88009956bc60 R08: ffff880134888758 R09: 0000000000000020 [ 1311.009500] R10: 0000000000004041 R11: 0000000000000000 R12: 0000004000000001 [ 1311.010651] R13: 0000000000000000 R14: 0000005000000041 R15: ffffffff81894a98 [ 1311.011800] FS: 00007f17f0044740(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000 [ 1311.013101] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1311.014034] CR2: 000000c00024ef68 CR3: 00000000aad04000 CR4: 00000000001607e0 [ 1311.015186] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1311.016339] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1311.017487] Stack: [ 1311.017827] 0000000000000001 ffff88009956bcb0 0000000000000000 ffffffff818a90d0 [ 1311.019131] ffffffff8102fcc0 ffffffff8109007e 0000000000000000 ffffffff818a90d0 [ 1311.020420] ffff88013dc13980 ffff88013dc13980 ffffffff8102fcc0 ffff88009956bcc0 [ 1311.021711] Call Trace: [ 1311.022125] [] ? do_flush_tlb_all+0x160/0x160 [ 1311.023100] [] ? smp_call_function_single+0x12e/0x150 [ 1311.024179] [] ? do_flush_tlb_all+0x160/0x160 [ 1311.025149] [] ? flush_tlb_page+0x72/0x130 [ 1311.026081] [] ? ptep_clear_flush+0x22/0x30 [ 1311.027023] [] ? do_wp_page+0x2ad/0x8c0 [ 1311.027912] [] ? handle_pte_fault+0x38d/0x9e0 [ 1311.028885] [] ? handle_mm_fault+0x135/0x2e0 [ 1311.029842] [] ? do_page_fault+0x14a/0x500 [ 1311.030800] [] ? vfs_read+0x140/0x170 [ 1311.031665] [] ? SyS_read+0x84/0xb0 [ 1311.032502] [] ? page_fault+0x22/0x30 [ 1311.033367] Code: ef 48 89 5d 00 48 89 55 08 48 89 2a e8 8a 78 41 00 4c 39 f3 75 0f 44 89 e7 48 8b 05 fb f1 78 00 e8 96 4d 20 00 f6 45 20 01 74 08 90 f6 45 20 01 75 f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 [ 1358.934527] BUG: soft lockup - CPU#1 stuck for 41s! [runc:19580] [ 1358.935547] Modules linked in: nfnetlink xfrm_user xfrm_algo fuse bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables 8021q vhost_scsi(O) vhost(O) tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) cdc_acm ftdi_sio ch341(OF) cp210x(OF) usbserial udf isofs loop synoacl_vfs(PO) btrfs zstd_decompress ecryptfs zstd_compress xxhash xor raid6_pq aesni_intel glue_helper lrw gf128mul ablk_helper zram(C) bromolow_synobios(PO) hid_generic usbhid hid usblp bnx2x(O) mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) qede(O) qed(O) atlantic_v2(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) i2c_algo_bit igb(O) dca e1000e(O) sg dm_snapshot crc_itu_t crc_ccitt psnap p8022 llc zlib_deflate libcrc32c hfsplus md4 hmac sit tunnel4 ipv6 flashcache_syno(O) flashcache(O) syno_flashcache_control(O) dm_mod crc32c_intel cryptd arc4 sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_powersave cpufreq_performance mperf processor thermal_sys cpufreq_stats freq_table vxlan ip_tunnel vmxnet3(F) etxhci_hcd mpt2sas(O) usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios] [ 1358.960892] CPU: 1 PID: 19580 Comm: runc Tainted: PF C O 3.10.108 #42214 [ 1358.962069] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 [ 1358.963782] task: ffff8800b656c820 ti: ffff880099568000 task.ti: ffff880099568000 [ 1358.964993] RIP: 0010:[] [] generic_exec_single+0x76/0xe0 [ 1358.966409] RSP: 0000:ffff88009956bc20 EFLAGS: 00000202 [ 1358.967273] RAX: 00000000000008fb RBX: 00000037ffffffc8 RCX: 000000000000008a [ 1358.968425] RDX: 0000000000000008 RSI: 00000000000000fb RDI: ffffffff81606628 [ 1358.969578] RBP: ffff88009956bc60 R08: ffff880134888758 R09: 0000000000000020 [ 1358.970728] R10: 0000000000004041 R11: 0000000000000000 R12: 0000004000000001 [ 1358.971880] R13: 0000000000000000 R14: 0000005000000041 R15: ffffffff81894a98 [ 1358.973036] FS: 00007f17f0044740(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000 [ 1358.974341] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1358.975275] CR2: 000000c00024ef68 CR3: 00000000aad04000 CR4: 00000000001607e0 [ 1358.976431] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1358.977581] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1358.978736] Stack: [ 1358.979078] 0000000000000001 ffff88009956bcb0 0000000000000000 ffffffff818a90d0 [ 1358.980383] ffffffff8102fcc0 ffffffff8109007e 0000000000000000 ffffffff818a90d0 [ 1358.981674] ffff88013dc13980 ffff88013dc13980 ffffffff8102fcc0 ffff88009956bcc0 [ 1358.982966] Call Trace: [ 1358.983382] [] ? do_flush_tlb_all+0x160/0x160 [ 1358.984360] [] ? smp_call_function_single+0x12e/0x150 [ 1358.985444] [] ? do_flush_tlb_all+0x160/0x160 [ 1358.986419] [] ? flush_tlb_page+0x72/0x130 [ 1358.987352] [] ? ptep_clear_flush+0x22/0x30 [ 1358.988299] [] ? do_wp_page+0x2ad/0x8c0 [ 1358.989190] [] ? handle_pte_fault+0x38d/0x9e0 [ 1358.990180] [] ? handle_mm_fault+0x135/0x2e0 [ 1358.991141] [] ? do_page_fault+0x14a/0x500 [ 1358.992101] [] ? vfs_read+0x140/0x170 [ 1358.992966] [] ? SyS_read+0x84/0xb0 [ 1358.993803] [] ? page_fault+0x22/0x30 [ 1358.994669] Code: 89 55 08 48 89 2a e8 8a 78 41 00 4c 39 f3 75 0f 44 89 e7 48 8b 05 fb f1 78 00 e8 96 4d 20 00 f6 45 20 01 74 08 f3 90 f6 45 20 01 <75> f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00 00 4c 8d 6b
[ 1406.895770] BUG: soft lockup - CPU#1 stuck for 41s! [runc:19580] [ 1406.896809] Modules linked in: nfnetlink xfrm_user xfrm_algo fuse bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables 8021q vhost_scsi(O) vhost(O) tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) cdc_acm ftdi_sio ch341(OF) cp210x(OF) usbserial udf isofs loop synoacl_vfs(PO) btrfs zstd_decompress ecryptfs zstd_compress xxhash xor raid6_pq aesni_intel glue_helper lrw gf128mul ablk_helper zram(C) bromolow_synobios(PO) hid_generic usbhid hid usblp bnx2x(O) mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) qede(O) qed(O) atlantic_v2(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) i2c_algo_bit igb(O) dca e1000e(O) sg dm_snapshot crc_itu_t crc_ccitt psnap p8022 llc zlib_deflate libcrc32c hfsplus md4 hmac sit tunnel4 ipv6 flashcache_syno(O) flashcache(O) syno_flashcache_control(O) dm_mod crc32c_intel cryptd arc4 sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_powersave cpufreq_performance mperf processor thermal_sys cpufreq_stats freq_table vxlan ip_tunnel vmxnet3(F) etxhci_hcd mpt2sas(O) usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios] [ 1406.922153] CPU: 1 PID: 19580 Comm: runc Tainted: PF C O 3.10.108 #42214 [ 1406.923331] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 [ 1406.925042] task: ffff8800b656c820 ti: ffff880099568000 task.ti: ffff880099568000 [ 1406.926248] RIP: 0010:[] [] generic_exec_single+0x76/0xe0 [ 1406.927665] RSP: 0000:ffff88009956bc20 EFLAGS: 00000202 [ 1406.928529] RAX: 00000000000008fb RBX: 00000037ffffffc8 RCX: 000000000000008a [ 1406.929681] RDX: 0000000000000008 RSI: 00000000000000fb RDI: ffffffff81606628 [ 1406.930836] RBP: ffff88009956bc60 R08: ffff880134888758 R09: 0000000000000020 [ 1406.931992] R10: 0000000000004041 R11: 0000000000000000 R12: 0000004000000001 [ 1406.933151] R13: 0000000000000000 R14: 0000005000000041 R15: ffffffff81894a98 [ 1406.934309] FS: 00007f17f0044740(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000 [ 1406.935625] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1406.936560] CR2: 000000c00024ef68 CR3: 00000000aad04000 CR4: 00000000001607e0 [ 1406.937715] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1406.938873] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1406.940031] Stack: [ 1406.940373] 0000000000000001 ffff88009956bcb0 0000000000000000 ffffffff818a90d0 [ 1406.941681] ffffffff8102fcc0 ffffffff8109007e 0000000000000000 ffffffff818a90d0 [ 1406.942975] ffff88013dc13980 ffff88013dc13980 ffffffff8102fcc0 ffff88009956bcc0 [ 1406.944267] Call Trace: [ 1406.944682] [] ? do_flush_tlb_all+0x160/0x160 [ 1406.945658] [] ? smp_call_function_single+0x12e/0x150 [ 1406.946744] [] ? do_flush_tlb_all+0x160/0x160 [ 1406.947719] [] ? flush_tlb_page+0x72/0x130 [ 1406.948654] [] ? ptep_clear_flush+0x22/0x30 [ 1406.949603] [] ? do_wp_page+0x2ad/0x8c0 [ 1406.950494] [] ? handle_pte_fault+0x38d/0x9e0 [ 1406.951471] [] ? handle_mm_fault+0x135/0x2e0 [ 1406.952433] [] ?
do_page_fault+0x14a/0x500 [ 1406.953394] [] ? vfs_read+0x140/0x170 [ 1406.954260] [] ? SyS_read+0x84/0xb0 [ 1406.955098] [] ? page_fault+0x22/0x30 [ 1406.955966] Code: 89 55 08 48 89 2a e8 8a 78 41 00 4c 39 f3 75 0f 44 89 e7 48 8b 05 fb f1 78 00 e8 96 4d 20 00 f6 45 20 01 74 08 f3 90 f6 45 20 01 <75> f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00 00 4c 8d 6b

CPU jump to 100% DSM is unresponsive and I must reset the VM.

image

WiteWulf commented 2 years ago

There have been a couple of further reports from users with Celeron CPUs, J1900, so not just limited to Xeon architecture. All running DS3615xs image.

OrpheeGT commented 2 years ago

Hello,

I tried with DSM 6.2.4, DSM hangs once docker start... must reset the VM.

BUG: soft lockup - CPU#1 stuck for 41s! [fileindexd:12641] Modules linked in: hid_generic cifs udf isofs loop tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) usbhid hid usblp bromolow_synobios(PO) exfat(O) btrfs synoacl_vfs(PO) zlib_deflate hfsplus md4 hmac bnx2x(O) libcrc32c mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) compat(O) qede(O) qed(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) igb(O) i2c_algo_bit e1000e(O) dca vxlan fuse vfat fat crc32c_intel aesni_intel glue_helper lrw gf128mul ablk_helper arc4 cryptd ecryptfs sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_conservative cpufreq_powersave cpufreq_performance cpufreq_ondemand mperf processor thermal_sys cpufreq_stats freq_table dm_snapshot crc_itu_t crc_ccitt quota_v2 quota_tree psnap p8022 llc sit tunnel4 ip_tunnel ipv6 zram(C) sg etxhci_hcd usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios] CPU: 1 PID: 12641 Comm: fileindexd Tainted: PF C O 3.10.105 #25556 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 task: ffff8801284fc800 ti: ffff880117a70000 task.ti: ffff880117a70000 RIP: 0010:[] [] generic_exec_single+0x68/0xe0 RSP: 0018:ffff880117a73cc0 EFLAGS: 00000202 RAX: 00000000000008fb RBX: ffff8801377dbec0 RCX: 0000000000000002 RDX: ffffffff816057c8 RSI: 00000000000000fb RDI: ffffffff816057c8 RBP: ffff88013dc12a80 R08: ffff88012788d358 R09: 0000000000000000 R10: 00007fc8c029a0a0 R11: ffff880115878ac0 R12: 8000000111e93067 R13: ffffea0003cc1348 R14: ffff8801284fc800 R15: ffffffff810d6cf5 FS: 00007fc8c2508700(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc8c2407000 CR3: 00000001156e4000 CR4: 00000000001607e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: 0000000000000000 ffff880117a73d50 0000000000000001 ffffffff8186ec10 ffffffff8102f9e0 ffffffff81087e55 0000000000000001 0000000000000000 ffff88013dc12a80 ffff88013dc12a80 ffffffff8102f9e0 ffff880117a73d70 Call Trace: [] ? do_flush_tlb_all+0x170/0x170 [] ? smp_call_function_single+0xd5/0x160 [] ? do_flush_tlb_all+0x170/0x170 [] ? flush_tlb_mm_range+0x22c/0x300 [] ? tlb_flush_mmu.part.66+0x29/0x80 [] ? tlb_finish_mmu+0x3d/0x40 [] ? unmap_region+0xbe/0x100 [] ? show_vfsmnt+0x104/0x140 [] ? vma_rb_erase+0x121/0x260 [] ? do_munmap+0x2ed/0x690 [] ? vm_munmap+0x36/0x50 [] ? SyS_munmap+0x5/0x10 [] ? system_call_fastpath+0x22/0x27 Code: 48 89 c6 48 89 5d 08 4c 89 ef 48 89 2b 48 89 53 08 48 89 1a e8 aa 68 44 00 4c 39 f5 74 6b f6 43 20 01 74 0f 0f 1f 80 00 00 00 00 90 f6 43 20 01 75 f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00

Edit :

Actually it is not only docker.

I installed Moments on 6.2.4, and tried to import 20 files in a raw...

System froze :

[ 1230.717458] BUG: soft lockup - CPU#1 stuck for 41s! [fileindexd:22537] [ 1230.718657] Modules linked in: cifs udf isofs loop hid_generic tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) usbhid hid usblp bromolow_synobios(PO) exfat(O) btrfs synoacl_vfs(PO) zlib_deflate hfsplus md4 hmac bnx2x(O) libcrc32c mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) compat(O) qede(O) qed(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) igb(O) i2c_algo_bit e1000e(O) dca vxlan fuse vfat fat crc32c_intel aesni_intel glue_helper lrw gf128mul ablk_helper arc4 cryptd ecryptfs sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_conservative cpufreq_powersave cpufreq_performance cpufreq_ondemand mperf processor thermal_sys cpufreq_stats freq_table dm_snapshot crc_itu_t crc_ccitt quota_v2 quota_tree psnap p8022 llc sit tunnel4 ip_tunnel ipv6 zram(C) sg etxhci_hcd usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios] [ 1230.736700] CPU: 1 PID: 22537 Comm: fileindexd Tainted: PF C O 3.10.105 #25556 [ 1230.738014] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 [ 1230.739795] task: ffff8800a83bf040 ti: ffff8800a6114000 task.ti: ffff8800a6114000 [ 1230.741052] RIP: 0010:[] [] generic_exec_single+0x6e/0xe0 [ 1230.742562] RSP: 0018:ffff8800a6117cc0 EFLAGS: 00000202 [ 1230.743461] RAX: 00000000000008fb RBX: 0000000000000001 RCX: 00000000000000d8 [ 1230.744658] RDX: ffffffff816057c8 RSI: 00000000000000fb RDI: ffffffff816057c8 [ 1230.745856] RBP: ffff88013dc12a80 R08: ffff88008cff6358 R09: 0000000000000000 [ 1230.747056] R10: 0000000000000022 R11: ffff88013390bf80 R12: ffff88008d58f7d0 [ 1230.748254] R13: ffff880108538788 R14: ffffffff81108d4d R15: ffff8800a6117da0 [ 1230.749452] FS: 00007fa28e90b700(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000 [ 1230.750805] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1230.751775] CR2: 00007fa2a361e000 CR3: 000000008d470000 CR4: 00000000001607e0 [ 1230.752994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1230.754159] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1230.755356] Stack: [ 1230.755708] 0000000000000000 ffff8800a6117d50 0000000000000001 ffffffff8186ec10 [ 1230.757041] ffffffff8102f9e0 ffffffff81087e55 0000000000000001 0000000000000000 [ 1230.758370] ffff88013dc12a80 ffff88013dc12a80 ffffffff8102f9e0 ffff8800a6117d70 [ 1230.759697] Call Trace: [ 1230.760135] [] ? do_flush_tlb_all+0x170/0x170 [ 1230.761141] [] ? smp_call_function_single+0xd5/0x160 [ 1230.762246] [] ? do_flush_tlb_all+0x170/0x170 [ 1230.763252] [] ? flush_tlb_mm_range+0x22c/0x300 [ 1230.764307] [] ? tlb_flush_mmu.part.66+0x29/0x80 [ 1230.765354] [] ? tlb_finish_mmu+0x3d/0x40 [ 1230.766300] [] ? unmap_region+0xbe/0x100 [ 1230.767236] [] ? vma_rb_erase+0x121/0x260 [ 1230.768185] [] ? do_munmap+0x2ed/0x690 [ 1230.769095] [] ? vm_munmap+0x36/0x50 [ 1230.769972] [] ? SyS_munmap+0x5/0x10 [ 1230.770859] [] ? system_call_fastpath+0x22/0x27 [ 1230.771886] Code: 08 4c 89 ef 48 89 2b 48 89 53 08 48 89 1a e8 aa 68 44 00 4c 39 f5 74 6b f6 43 20 01 74 0f 0f 1f 80 00 00 00 00 f3 90 f6 43 20 01 <75> f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00 00 4c 8d 6d

CPU raise at 100% image

@WiteWulf could you try ?

Moments seems to be continuing to import images, but very slowly, and DSM is unresponsive.

Edit 2 : now once moments app start, it hangs DSM... I can't block it to start.

labrouss commented 2 years ago

After uploading a batch of photos with face detection enabled, the same happens on baremetal running DS3615xs 7.0.1 with AMD CPU . I get consistent hangs/freezes.

Unfortunatelly i have no serial console to get the soft lockup output.

I will try to increase the watchdog timer to 60 and see if this helps :

echo 60 > /proc/sys/kernel/watchdog_thresh echo "kernel.watchdog_thresh = 60" >> /etc/sysctl.conf sysctl -p /etc/sysctl.conf

sysctl -a |grep -i watch kernel.watchdog_thresh = 60

WiteWulf commented 2 years ago

After uploading a batch of photos, the same happens on baremetal. I get consistent hangs/freezes.

Can you confirm:

...and can you post the serial console output when you get a hang/freeze?

Hang/freezes have only been observed so far on virtualisation platforms (such as ESXi and Proxmox), with kernel panics on baremetal.

WiteWulf commented 2 years ago

@WiteWulf could you try ?

Moments seems to be continuing to import images, but very slowly, and DSM is unresponsive.

Edit 2 : now once moments app start, it hangs DSM... I can't block it to start.

I'm on 7.0.1-RC1, so don't have Moments, but I installed Photos (I wasn't using it previously) and uploaded ~250 images to it. CPU usage barely got above 20% (looks like it's single-threaded) but disk writes were much higher than when I've seen kernel panics when running docker stuff that crashed (this makes me think this isn't related to high disk throughput). The server did not kernel panic or lock up.

OrpheeGT commented 2 years ago

@WiteWulf I confirm Synology Photos does not suffer the same issue... My DSM 7.0.1 RC does not have issue too. It happened on DSM 6.2.4 with Synology Moments.

labrouss commented 2 years ago

@OrpheeGT You get a soft lockup on fileindexd. Which will be called in various times for various application indexing.

[ 1230.717458] BUG: soft lockup - CPU#1 stuck for 41s! [fileindexd:22537]

Increase the watchdog threshold to 60 (Seconds maximum)

WiteWulf commented 2 years ago

I noticed it was actually fileindexd that was named in the kernel panic output you posted, so I pointed Universal Search at a ~2TB music folder on my NAS to see if that crashes.

And...BOOM! synoelasticd (another database process) kernel panics:

[77556.729069] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 4
[77556.764160] CPU: 4 PID: 14678 Comm: synoelasticd Tainted: PF          O 3.10.108 #42214
[77556.803544] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[77556.838036]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409b08d60
[77556.874402]  ffff880409b08cf8 0000000000000000 0000000000000004 0000000000000001
[77556.910661]  0000000000000004 ffffffff80000001 0000000000000030 ffff8803f4cd4c00
[77556.947092] Call Trace:
[77556.958946]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[77556.986979]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[77557.011271]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[77557.043922]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[77557.075077]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[77557.107901]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[77557.138530]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[77557.169348]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[77557.194810]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[77557.222568]  <<EOE>> 
[77557.233041] Rebooting in 3 seconds..
labrouss commented 2 years ago

@WiteWulf Yes last thing i saw while running htop was synoelacticd as well

WiteWulf commented 2 years ago

So far I've seen kernel panics mostly related to:

Elasticsearch seems to be the back-end used by Synology Universal Indexer), and the others have been installed by users in docker containers.

OrpheeGT commented 2 years ago

So title can be renamed. it is not only docker... @WiteWulf you made it crash on DSM 7 ?

WiteWulf commented 2 years ago

Yes, Universal Search crashed on DSM7 while indexing a large folder (many large files).

Changed title to reflect not just docker causing issues.

OrpheeGT commented 2 years ago

Synology Photos must work diffently because I added to it like 20 images in a row like I did with Moments, but no issue with it...

OrpheeGT commented 2 years ago

I may have failed by I remove the register_pmu_shim line where @ttg-public requested for @WiteWulf

then I tried to import a 20 files in a row with Synology Moments app :

[ 20.206998] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 20.210142] <redpill/smart_shim.c:388> Generating fake SMART values [ 20.215051] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there [ 20.377729] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there [ 20.754611] BTRFS: device label 2021.09.22-14:16:50 v25556 devid 1 transid 383 /dev/md2 [ 20.758828] BTRFS info (device md2): enabling auto syno reclaim space [ 20.761395] BTRFS info (device md2): use ssd allocation scheme [ 20.763715] BTRFS info (device md2): using free space tree [ 20.766100] BTRFS: has skinny extents [ 21.534082] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 21.535867] <redpill/smart_shim.c:388> Generating fake SMART values [ 21.891207] usbcore: registered new interface driver usblp [ 22.162279] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 22.163660] <redpill/smart_shim.c:388> Generating fake SMART values [ 22.560571] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 22.562195] Synotify use 16384 event queue size [ 22.562213] Synotify use 16384 event queue size [ 22.564978] <redpill/smart_shim.c:388> Generating fake SMART values [ 22.567083] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 22.569133] <redpill/smart_shim.c:388> Generating fake SMART values [ 22.571174] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd5 [ 22.573226] <redpill/smart_shim.c:514> Generating fake WIN_SMART log=0 entries [ 22.653817] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there [ 22.751817] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 22.753898] <redpill/smart_shim.c:388> Generating fake SMART values [ 22.756290] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd1 [ 22.758341] <redpill/smart_shim.c:455> Generating fake SMART thresholds [ 22.764050] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there [ 22.776168] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 22.778231] <redpill/smart_shim.c:388> Generating fake SMART values [ 22.971983] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there [ 23.136111] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 23.137513] <redpill/smart_shim.c:388> Generating fake SMART values [ 23.145039] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd1 [ 23.146412] <redpill/smart_shim.c:455> Generating fake SMART thresholds [ 23.152186] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there [ 23.165414] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 23.166767] <redpill/smart_shim.c:388> Generating fake SMART values [ 23.237426] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there [ 23.239306] iSCSI:target_core_rodsp_server.c:1027:rodsp_server_init RODSP server started, login_key(001132417efd). [ 23.316261] iSCSI:extent_pool.c:766:ep_init syno_extent_pool successfully initialized [ 23.355192] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 23.356547] <redpill/smart_shim.c:388> Generating fake SMART values [ 23.366572] iSCSI:target_core_device.c:617:se_dev_align_max_sectors Rounding down aligned max_sectors from 4294967295 to 4294967288 [ 23.368681] iSCSI:target_core_lunbackup.c:361:init_io_buffer_head 512 buffers allocated, total 2097152 bytes successfully [ 23.373280] Synotify use 16384 event queue size [ 23.381171] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd1 [ 23.382530] <redpill/smart_shim.c:455> Generating fake SMART thresholds [ 23.398635] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there [ 23.412314] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 23.413674] <redpill/smart_shim.c:388> Generating fake SMART values [ 23.429148] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there [ 23.582194] iSCSI:target_core_file.c:146:fd_attach_hba RODSP plugin for fileio is enabled. [ 23.583646] iSCSI:target_core_file.c:153:fd_attach_hba ODX Token Manager is enabled. [ 23.585700] iSCSI:target_core_multi_file.c:91:fd_attach_hba RODSP plugin for multifile is enabled. [ 23.587375] iSCSI:target_core_ep.c:786:ep_attach_hba RODSP plugin for epio is enabled. [ 23.588717] iSCSI:target_core_ep.c:793:ep_attach_hba ODX Token Manager is enabled. [ 23.835435] usbcore: registered new interface driver usbhid [ 23.836477] usbhid: USB HID core driver [ 23.912684] input: VMware VMware Virtual USB Mouse as /devices/pci0000:00/0000:00:11.0/0000:02:00.0/usb2/2-1/2-1:1.0/input/input0 [ 23.914751] hid-generic 0003:0E0F:0003.0001: input: USB HID v1.10 Mouse [VMware VMware Virtual USB Mouse] on usb-0000:02:00.0-1/input0 [ 24.459275] loop: module loaded [ 24.469102] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there [ 24.988731] warning: `nginx' uses 32-bit capabilities (legacy support in use) [ 25.587723] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there [ 25.590445] ata5.00: configured for UDMA/100 [ 25.591176] ata5: EH complete [ 25.592038] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd8 [ 25.593386] <redpill/smart_shim.c:654> Attempted ATA_SMART_ENABLE modification! [ 25.595248] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there

Xpen_624 login: [ 25.623023] ata5.00: configured for UDMA/100 [ 25.623774] ata5: EH complete [ 25.624479] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there [ 29.023868] <redpill/intercept_execve.c:87> Blocked /usr/syno/bin/syno_pstore_collect from running [ 29.232232] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 29.233668] <redpill/smart_shim.c:388> Generating fake SMART values [ 29.338667] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 29.340481] <redpill/smart_shim.c:388> Generating fake SMART values [ 29.343761] <redpill/memory_helper.c:18> Disabling memory protection for page(s) at ffffffffa09f4c50+12/1 (<<ffffffffa09f4000) [ 29.346380] <redpill/override_symbol.c:244> Obtaining lock for [ 29.348127] <redpill/override_symbol.c:244> Writing original code to [ 29.349960] <redpill/override_symbol.c:244> Released lock for [ 29.351690] <redpill/override_symbol.c:219> Obtaining lock for [ 29.353406] <redpill/override_symbol.c:219> Writing trampoline code to [ 29.355276] <redpill/override_symbol.c:219> Released lock for [ 29.357033] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0] [ 29.449363] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there [ 29.458673] synobios write K to /dev/ttyS1 failed [ 29.474660] <redpill/bios_shims_collection.c:43> mfgBIOS: nullify zero-int for VTK_SET_HDD_ACT_LED [ 29.477850] <redpill/override_symbol.c:244> Obtaining lock for [ 29.479807] <redpill/override_symbol.c:244> Writing original code to [ 29.481891] <redpill/override_symbol.c:244> Released lock for [ 29.483826] <redpill/override_symbol.c:219> Obtaining lock for [ 29.485779] <redpill/override_symbol.c:219> Writing trampoline code to [ 29.487895] <redpill/override_symbol.c:219> Released lock for [ 29.489816] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=2)->support => real=1 [org_fout=0, ovs_fout=0] [ 29.501901] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 29.509993] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 29.513433] <redpill/bios_shims_collection.c:42> mfgBIOS: nullify zero-int for VTK_SET_PHY_LED [ 29.515742] <redpill/bios_shims_collection.c:36> mfgBIOS: nullify zero-int for VTK_SET_PWR_LED [ 29.520330] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 29.526123] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 29.537159] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 29.544131] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 29.550320] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 29.556763] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 29.563306] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 29.569841] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 29.579919] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 29.587099] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 29.602319] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 29.627963] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 29.636158] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 30.369451] init: syno-check-disk-compatibility main process (12582) terminated with status 255 [ 30.428262] ip_tables: (C) 2000-2006 Netfilter Core Team [ 30.435840] nf_conntrack version 0.5.0 (16384 buckets, 65536 max) [ 30.507642] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 30.555223] aufs 3.10.x-20141110 [ 30.565205] Bridge firewalling registered [ 32.920553] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready [ 37.436982] Synotify use 16384 event queue size [ 37.438642] Synotify use 16384 event queue size [ 38.686780] <redpill/bios_shims_collection.c:44> mfgBIOS: nullify zero-int for VTK_GET_MICROP_ID [ 39.145534] <redpill/override_symbol.c:244> Obtaining lock for [ 39.148066] <redpill/override_symbol.c:244> Writing original code to [ 39.150551] <redpill/override_symbol.c:244> Released lock for [ 39.152899] <redpill/override_symbol.c:219> Obtaining lock for [ 39.155222] <redpill/override_symbol.c:219> Writing trampoline code to [ 39.157757] <redpill/override_symbol.c:219> Released lock for [ 39.159684] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0] [ 40.327575] Synotify use 16384 event queue size [ 40.329555] Synotify use 16384 event queue size [ 41.392997] init: synocontentextractd main process ended, respawning [ 43.327169] <redpill/override_symbol.c:244> Obtaining lock for [ 43.329183] <redpill/override_symbol.c:244> Writing original code to [ 43.331258] <redpill/override_symbol.c:244> Released lock for [ 43.333175] <redpill/override_symbol.c:219> Obtaining lock for [ 43.335111] <redpill/override_symbol.c:219> Writing trampoline code to [ 43.337222] <redpill/override_symbol.c:219> Released lock for [ 43.339135] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0] [ 48.883979] Synotify use 16384 event queue size [ 51.178339] <redpill/bios_shims_collection.c:44> mfgBIOS: nullify zero-int for VTK_GET_MICROP_ID [ 51.536832] <redpill/override_symbol.c:244> Obtaining lock for [ 51.539283] <redpill/override_symbol.c:244> Writing original code to [ 51.541795] <redpill/override_symbol.c:244> Released lock for [ 51.544112] <redpill/override_symbol.c:219> Obtaining lock for [ 51.546456] <redpill/override_symbol.c:219> Writing trampoline code to [ 51.549007] <redpill/override_symbol.c:219> Released lock for [ 51.551336] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0] [ 82.539367] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 82.543593] <redpill/smart_shim.c:388> Generating fake SMART values [ 139.927989] BUG: soft lockup - CPU#1 stuck for 41s! [fileindexd:12879] [ 139.929144] Modules linked in: bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables cifs udf isofs loop hid_generic tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) usbhid hid usblp bromolow_synobios(PO) exfat(O) btrfs synoacl_vfs(PO) zlib_deflate hfsplus md4 hmac bnx2x(O) libcrc32c mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) compat(O) qede(O) qed(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) igb(O) i2c_algo_bit e1000e(O) dca vxlan fuse vfat fat crc32c_intel aesni_intel glue_helper lrw gf128mul ablk_helper arc4 cryptd ecryptfs sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_conservative cpufreq_powersave cpufreq_performance cpufreq_ondemand mperf processor thermal_sys cpufreq_stats freq_table dm_snapshot crc_itu_t crc_ccitt quota_v2 quota_tree psnap p8022 llc sit tunnel4 ip_tunnel ipv6 zram(C) sg etxhci_hcd usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios] [ 139.952060] CPU: 1 PID: 12879 Comm: fileindexd Tainted: PF C O 3.10.105 #25556 [ 139.953333] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 [ 139.955060] task: ffff8801327bd800 ti: ffff88011e8b4000 task.ti: ffff88011e8b4000 [ 139.956274] RIP: 0010:[] [] generic_exec_single+0x6e/0xe0 [ 139.957725] RSP: 0018:ffff88011e8b7cc0 EFLAGS: 00000202 [ 139.958592] RAX: 00000000000008fb RBX: 0000000000000001 RCX: 0000000000000014 [ 139.959747] RDX: ffffffff816057c8 RSI: 00000000000000fb RDI: ffffffff816057c8 [ 139.960903] RBP: ffff88013dc12a80 R08: ffff88011448fe58 R09: 0000000000000000 [ 139.962058] R10: 0000000000000022 R11: ffff8800b49decc0 R12: ffff8800b0596890 [ 139.963214] R13: ffff88011a1c4170 R14: ffffffff81108d4d R15: ffff88011e8b7da0 [ 139.964374] FS: 00007fba3cf4c700(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000 [ 139.965669] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 139.966602] CR2: 00007fba3ce4b000 CR3: 000000011b332000 CR4: 00000000001607e0 [ 139.967762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 139.968916] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 139.970047] Stack: [ 139.970381] 0000000000000000 ffff88011e8b7d50 0000000000000001 ffffffff8186ec10 [ 139.971675] ffffffff8102f9e0 ffffffff81087e55 0000000000000001 0000000000000000 [ 139.972974] ffff88013dc12a80 ffff88013dc12a80 ffffffff8102f9e0 ffff88011e8b7d70 [ 139.974283] Call Trace: [ 139.974715] [] ? do_flush_tlb_all+0x170/0x170 [ 139.975699] [] ? smp_call_function_single+0xd5/0x160 [ 139.976778] [] ? do_flush_tlb_all+0x170/0x170 [ 139.977758] [] ? flush_tlb_mm_range+0x22c/0x300 [ 139.978772] [] ? tlb_flush_mmu.part.66+0x29/0x80 [ 139.979793] [] ? tlb_finish_mmu+0x3d/0x40 [ 139.980718] [] ? unmap_region+0xbe/0x100 [ 139.981630] [] ? vma_rb_erase+0x121/0x260 [ 139.982555] [] ? do_munmap+0x2ed/0x690 [ 139.983437] [] ? vm_munmap+0x36/0x50 [ 139.984293] [] ? SyS_munmap+0x5/0x10 [ 139.985165] [] ? system_call_fastpath+0x22/0x27 [ 139.986173] Code: 08 4c 89 ef 48 89 2b 48 89 53 08 48 89 1a e8 aa 68 44 00 4c 39 f5 74 6b f6 43 20 01 74 0f 0f 1f 80 00 00 00 00 f3 90 f6 43 20 01 <75> f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00 00 4c 8d 6d [ 191.885952] BUG: soft lockup - CPU#1 stuck for 44s! [fileindexd:12879] [ 191.887102] Modules linked in: bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables cifs udf isofs loop hid_generic tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) usbhid hid usblp bromolow_synobios(PO) exfat(O) btrfs synoacl_vfs(PO) zlib_deflate hfsplus md4 hmac bnx2x(O) libcrc32c mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) compat(O) qede(O) qed(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) igb(O) i2c_algo_bit e1000e(O) dca vxlan fuse vfat fat crc32c_intel aesni_intel glue_helper lrw gf128mul ablk_helper arc4 cryptd ecryptfs sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_conservative cpufreq_powersave cpufreq_performance cpufreq_ondemand mperf processor thermal_sys cpufreq_stats freq_table dm_snapshot crc_itu_t crc_ccitt quota_v2 quota_tree psnap p8022 llc sit tunnel4 ip_tunnel ipv6 zram(C) sg etxhci_hcd usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios] [ 191.910526] CPU: 1 PID: 12879 Comm: fileindexd Tainted: PF C O 3.10.105 #25556 [ 191.911840] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 [ 191.913618] task: ffff8801327bd800 ti: ffff88011e8b4000 task.ti: ffff88011e8b4000 [ 191.914869] RIP: 0010:[] [] generic_exec_single+0x6a/0xe0 [ 191.916330] RSP: 0018:ffff88011e8b7cc0 EFLAGS: 00000202 [ 191.917225] RAX: 00000000000008fb RBX: 0000000000000001 RCX: 0000000000000014 [ 191.918416] RDX: ffffffff816057c8 RSI: 00000000000000fb RDI: ffffffff816057c8 [ 191.919609] RBP: ffff88013dc12a80 R08: ffff88011448fe58 R09: 0000000000000000 [ 191.920803] R10: 0000000000000022 R11: ffff8800b49decc0 R12: ffff8800b0596890 [ 191.921996] R13: ffff88011a1c4170 R14: ffffffff81108d4d R15: ffff88011e8b7da0 [ 191.923191] FS: 00007fba3cf4c700(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000 [ 191.924546] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 191.925512] CR2: 00007fba3ce4b000 CR3: 000000011b332000 CR4: 00000000001607e0 [ 191.926717] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 191.927917] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 191.929108] Stack: [ 191.929461] 0000000000000000 ffff88011e8b7d50 0000000000000001 ffffffff8186ec10 [ 191.930797] ffffffff8102f9e0 ffffffff81087e55 0000000000000001 0000000000000000 [ 191.932131] ffff88013dc12a80 ffff88013dc12a80 ffffffff8102f9e0 ffff88011e8b7d70 [ 191.933483] Call Trace: [ 191.933912] [] ? do_flush_tlb_all+0x170/0x170 [ 191.934921] [] ? smp_call_function_single+0xd5/0x160 [ 191.936030] [] ? do_flush_tlb_all+0x170/0x170 [ 191.937038] [] ? flush_tlb_mm_range+0x22c/0x300 [ 191.938076] [] ? tlb_flush_mmu.part.66+0x29/0x80 [ 191.939129] [] ? tlb_finish_mmu+0x3d/0x40 [ 191.940083] [] ? unmap_region+0xbe/0x100 [ 191.941023] [] ? vma_rb_erase+0x121/0x260 [ 191.941975] [] ? do_munmap+0x2ed/0x690 [ 191.942886] [] ? vm_munmap+0x36/0x50 [ 191.943767] [] ? SyS_munmap+0x5/0x10 [ 191.944648] [] ? system_call_fastpath+0x22/0x27 [ 191.945684] Code: c6 48 89 5d 08 4c 89 ef 48 89 2b 48 89 53 08 48 89 1a e8 aa 68 44 00 4c 39 f5 74 6b f6 43 20 01 74 0f 0f 1f 80 00 00 00 00 f3 90 43 20 01 75 f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00

So it still does the same behavior for me...

labrouss commented 2 years ago

I confirm that the universal search on the DS3615 image freezes the system on both physical and virtual even with the latest code. Looks like the redpill module is crashing as i get ATA disk and btrfs errors.

On DS918, It does not have the same effect.

WiteWulf commented 2 years ago

@labrouss can you confirm that your baremetal/physical system is freezing and not kernel panicking? All other reports I've had indicate kernel panic followed by reboot on baremetal, and freeze/lockup on virtual. It's important to be clear what you're experiencing here.

WiteWulf commented 2 years ago

As per @ttg-public's suggestion on the forum, "Can you try deleting the line with "register_pmu_shim" from redpillmain.c (in init(void) function) and rebuilding the kernel module?" I rebuilt the loader without the PMU shim and am still seeing the kernel panics when loading my influxdb docker container.

[  338.055690] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 6
[  338.091670] CPU: 6 PID: 21097 Comm: containerd-shim Tainted: PF          O 3.10.108 #42214
[  338.132114] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  338.168045]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409b88d60
[  338.205031]  ffff880409b88cf8 0000000000000000 0000000000000006 0000000000000001
[  338.241507]  0000000000000006 ffffffff80000001 0000000000000030 ffff8803f4d4dc00
[  338.278173] Call Trace:
[  338.290006]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[  338.318839]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[  338.342727]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[  338.375043]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[  338.405804]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[  338.438356]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[  338.469218]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[  338.500130]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[  338.525060]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[  338.552408]  <<EOE>>
[  338.562333] Rebooting in 3 seconds..
ilovepancakes95 commented 2 years ago

I confirm that the universal search on the DS3615 image freezes the system on both physical and virtual even with the latest code. Looks like the redpill module is crashing as i get ATA disk and btrfs errors.

On DS918, It does not have the same effect.

For the record, I am not experiencing this issue. Using universal search/having it index the entire home folder does not result in any issues. I don't have a lot of data in home folder, as this is a test VM so not sure if your problem only occurs with higher I/O load and lots of data to index? I am running DSM 7 41222 RedPill 3615xs on ESXi v7.

OrpheeGT commented 2 years ago

I confirm that the universal search on the DS3615 image freezes the system on both physical and virtual even with the latest code. Looks like the redpill module is crashing as i get ATA disk and btrfs errors. On DS918, It does not have the same effect.

For the record, I am not experiencing this issue. Using universal search/having it index the entire home folder does not result in any issues. I don't have a lot of data in home folder, as this is a test VM so not sure if your problem only occurs with higher I/O load and lots of data to index? I am running DSM 7 41222 RedPill 3615xs on ESXi v7.

Could you just try to install docker and influxdb container on it ?

WiteWulf commented 2 years ago

For the record, I am not experiencing this issue. Using universal search/having it index the entire home folder does not result in any issues. I don't have a lot of data in home folder, as this is a test VM so not sure if your problem only occurs with higher I/O load and lots of data to index? I am running DSM 7 41222 RedPill 3615xs on ESXi v7.

I found that simply indexing a few photos didn't cause problems, but pointing it at a very large music folder (2TB, 160k files) caused a kernel panic after a few minutes.

labrouss commented 2 years ago

Both baremetal and physical is not panicking and not rebooting but are having soft lockups. The issue with the universal search is consistent and has been verified numerous times.

I have a VM running an old image of the DS3615 with RedPill v0.5-git-23578eb. Universal search does not have the same effect. It works as expected.

You can find your version of redpill module by running : dmesg |grep RedPill

You can also do a "git checkout 23578eb" , recompile, inject the module into your rd.gz and check

WiteWulf commented 2 years ago

You can also do a "git checkout 23578eb" , recompile, inject the module into your rd.gz and check

Interesting that that commit is related to the PMU shim, notably bug fixes and enabling it by default.

So, to review:

Thanks for the info!

WiteWulf commented 2 years ago

The latest redpill code (3474d9b) actually seems more prone to kernel panicking than previous versions. My influxdb container was guaranteed to crash the system every time I started it, but the others I had were typically stable and non-problematic. I'm now seeing immediate kernel panics when starting a mysql container that previously didn't cause any problems.

I've gone back to a slightly older build (021ed51), and mysql (operating as a database backend for librenms) starts and runs without problems.

(FYI, I'm not using 021ed51 for any reason other than that I already had a USB stick with it on. I don'r perceive it to be more stable than any other commit).

WiteWulf commented 2 years ago

As an experiment, I took the HDDs out of my Gen8 and put an old spare drive in, booted it from the stick with redpill 021ed51 on it and did a fresh install. I installed a load of docker containers (mysql, mariadb, influxdb and nginx) and pretty soon mysqld kernel panic'd:

Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 4
[ 1421.492280] CPU: 4 PID: 28085 Comm: mysqld Tainted: PF        C O 3.10.108 #42214
[ 1421.529738] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[ 1421.564106]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409b08d60
[ 1421.599651]  ffff880409b08cf8 0000000000000000 0000000000000004 0000000000000001
[ 1421.635065]  0000000000000004 ffffffff80000001 0000000000000030 ffff8803f4cd4c00
[ 1421.671546] Call Trace:
[ 1421.683853]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[ 1421.712185]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[ 1421.737554]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[ 1421.771289]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[ 1421.802524]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[ 1421.838746]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[ 1421.870141]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[ 1421.901388]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[ 1421.925556]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[ 1421.953482]  <<EOE>> 
[ 1422.998713] Shutting down cpus with NMI
[ 1423.017980] Rebooting in 3 seconds..

I basically wanted to establish whether or not the crashes were related to data or configuration carried over from my previous 6.2.3 install, and it doesn't look like it is.

WiteWulf commented 2 years ago

Quick follow up on the above: I had to push the system to crash. Influxdb didn't do it straight away as it does on my "live" server, nor Universal Search indexing a few GBs of data. Once I booted it back into my live system it crashed again starting up docker and came back up scrubbing the disks. While disks were scrubbing any attempt to start up docker would cause another kernel panic, but leaving it to finish scrubbing over night I was able to start everything but the influxdb container without trouble.

This suggests to me that the problem is related to load somehow. It's nothing as obvious as CPU load, or disk throughput, as I can quite happily stream 4k video and transcode 1080p content with Plex on this system. It's something harder to pin down.

labrouss commented 2 years ago

I see you are running 7.0.1 42214, can you try with a 7.0 loader instead ? As i tested that though its not possible to recover to an earlier version, you will have to reinstall.

labrouss commented 2 years ago

The universal search issue seems to be related specifically to 7.0.1 and 3615xs. I'm now testing on 7.0-41222 and everything looks more stable.

WiteWulf commented 2 years ago

I’ve had kernel panics related to docker in 6.2.4 and 7.0.1-RC1, so I see no reason to assume it would be different on 7.0. What would be interesting to test is a redpill version of 6.2.3, as I had no problems at all on that release with Jun’s loader.

labrouss commented 2 years ago

Since only 7.0-41222 and 6.2.4 are directly linked to the redpill loader, these are the versions that are tested by TTG. So better stick with 7.0-41222 for the moment

WiteWulf commented 2 years ago

I see your point, but Synology never took 7.0 out of beta for ds3615xs, whereas 7.0.1 has a release candidate already. I’m trying to balance what TTG supports with the quality of Synology’s own code. I don’t think there will ever be a production version of 7.0 for ds3615xs from Synology.

But this afternoon I’ll trying putting the 7.0 beta on this system and see if I can crash it.

WiteWulf commented 2 years ago

There you go, mysqld kernel panic on 7.0-beta (41222) with the latest redpill code (3474d9b). This is with barely any load on the system, it's a fresh install with no data.

Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 7
[  691.815225] CPU: 7 PID: 24049 Comm: mysqld Tainted: PF        C O 3.10.108 #41222
[  691.851145] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  691.885067]  ffffffff8149d161 ffffffff8149c0b9 0000000000000010 ffff880409bc8d68
[  691.920497]  ffff880409bc8d00 0000000000000000 0000000000000007 0000000000000001
[  691.955407]  0000000000000007 ffffffff80000001 0000000000000030 ffff8803f4dafc00
[  691.990786] Call Trace:
[  692.002456]  <NMI>  [<ffffffff8149d161>] ? dump_stack+0xc/0x15
[  692.030418]  [<ffffffff8149c0b9>] ? panic+0xbb/0x1df
[  692.054250]  [<ffffffff810a95f8>] ? watchdog_overflow_callback+0xa8/0xb0
[  692.086330]  [<ffffffff810dabb3>] ? __perf_event_overflow+0x93/0x230
[  692.116773]  [<ffffffff810d9a02>] ? perf_event_update_userpage+0x12/0xf0
[  692.148655]  [<ffffffff81015194>] ? intel_pmu_handle_irq+0x1b4/0x340
[  692.178890]  [<ffffffff814a4cc6>] ? perf_event_nmi_handler+0x26/0x40
[  692.209301]  [<ffffffff814a444e>] ? do_nmi+0xfe/0x400
[  692.233623]  [<ffffffff814a3a53>] ? end_repeat_nmi+0x1e/0x7e
[  692.260552]  <<EOE>> 
[  693.307534] Shutting down cpus with NMI
[  693.326761] Rebooting in 3 seconds..
labrouss commented 2 years ago

Wow ! Thats bad news then...

ilovepancakes95 commented 2 years ago

Could you just try to install docker and influxdb container on it ?

No problem so far. Installed Docker, ran it, created and ran an influxdb container and it has been running still, no crashes for over a day. DSM still shows uptime of 11 days so that didn't crash either.

labrouss commented 2 years ago

Can you please list which modules you all have loaded ? I’m suspecting that tg3 and Libphy that I’m loading might cause the panic in my case

WiteWulf commented 2 years ago

I'm not loading any additional modules, and people on ESXi/proxmox/etc. won't be doing, either.

However, here you go:

$ lsmod
Module                  Size  Used by
nfnetlink               3645  1 
cmac                    2437  0 
bridge                 55125  0 
stp                     1533  1 bridge
aufs                  179875  424 
macvlan                 8936  0 
veth                    4087  0 
xt_conntrack            2953  1 
xt_addrtype             2733  1 
ipt_MASQUERADE          2002  6 
xt_REDIRECT             1838  0 
xt_nat                  1785  4 
fuse                   77817  0 
xt_policy               2362  1 
xfrm6_mode_transport     1258  0 
xfrm4_mode_transport     1194  0 
xfrm6_mode_tunnel       1584  0 
xfrm4_mode_tunnel       2276  0 
xfrm6_mode_beet         1706  0 
xfrm4_mode_beet         1755  0 
deflate                 1799  0 
authencesn              7075  0 
ipcomp6                 1908  0 
ipcomp                  1908  0 
xfrm6_tunnel            3247  1 ipcomp6
xfrm4_tunnel            1737  0 
tunnel6                 1968  1 xfrm6_tunnel
esp6                    5721  0 
esp4                    6069  0 
ah6                     5056  0 
ah4                     5872  0 
xfrm_ipcomp             3676  2 ipcomp,ipcomp6
af_key                 25944  0 
xfrm_user              21964  3 
xfrm_algo               4424  7 ah4,ah6,esp4,esp6,af_key,xfrm_user,xfrm_ipcomp
l2tp_ppp               14700  0 
l2tp_core              19555  1 l2tp_ppp
ppp_deflate             3714  0 
bsd_comp                4914  0 
ppp_mppe                6347  0 
xt_TCPMSS               3319  1 
iptable_mangle          1432  1 
pppox                   1586  1 l2tp_ppp
ppp_async               6970  0 
ppp_generic            17405  6 l2tp_ppp,pppox,bsd_comp,ppp_mppe,ppp_async,ppp_deflate
slhc                    5235  1 ppp_generic
8021q                  16612  0 
vhost_scsi             29730  1 
vhost                  28410  1 vhost_scsi
tcm_loop               17867  1 
iscsi_target_mod      331057  1 
target_core_ep         56513  2 
target_core_multi_file    37241  1 
target_core_file       58183  1 
target_core_iblock     24741  1 
target_core_mod       959025  21 target_core_iblock,target_core_multi_file,vhost,iscsi_target_mod,target_core_ep,target_core_file,vhost_scsi,tcm_loop
syno_extent_pool     1369025  0 
rodsp_ep               87591  3 target_core_multi_file,syno_extent_pool,target_core_file
udf                    80121  0 
isofs                  32684  0 
loop                   16680  0 
quota_v2                3759  2 
quota_tree              7681  1 quota_v2
synoacl_vfs            17877  1 
raid456               102351  2 
async_raid6_recov       6182  1 raid456
async_memcpy            1726  2 raid456,async_raid6_recov
async_pq                4708  2 raid456,async_raid6_recov
async_xor               3585  3 async_pq,raid456,async_raid6_recov
xor                    10776  1 async_xor
async_tx                2038  5 async_pq,raid456,async_xor,async_memcpy,async_raid6_recov
raid6_pq               98028  3 async_pq,raid456,async_raid6_recov
raid0                  11872  1 
iptable_nat             3078  1 
nf_nat_ipv4             3192  1 iptable_nat
nf_nat                 13397  5 ipt_MASQUERADE,nf_nat_ipv4,xt_nat,xt_REDIRECT,iptable_nat
nf_conntrack_ipv6       7491  0 
nf_defrag_ipv6         24937  1 nf_conntrack_ipv6
ip6table_filter         1308  1 
ip6_tables             14356  1 ip6table_filter
xt_recent               8340  0 
xt_iprange              2384  0 
xt_limit                1761  5 
xt_state                1143  0 
xt_tcpudp               3023  23 
xt_multiport            2278  0 
xt_LOG                 11696  0 
nf_conntrack_ipv4      12492  2 
nf_defrag_ipv4          1219  1 nf_conntrack_ipv4
nf_conntrack           58213  8 ipt_MASQUERADE,nf_nat,xt_state,nf_nat_ipv4,xt_conntrack,iptable_nat,nf_conntrack_ipv4,nf_conntrack_ipv6
iptable_filter          1368  1 
ip_tables              14742  3 iptable_filter,iptable_mangle,iptable_nat
x_tables               16556  19 ip6table_filter,xt_iprange,xt_policy,xt_recent,ip_tables,xt_tcpudp,ipt_MASQUERADE,xt_limit,xt_state,xt_conntrack,xt_LOG,xt_nat,xt_multiport,iptable_filter,xt_TCPMSS,xt_REDIRECT,iptable_mangle,ip6_tables,xt_addrtype
aesni_intel            43853  0 
glue_helper             4073  1 aesni_intel
lrw                     3349  1 aesni_intel
gf128mul                5386  1 lrw
ablk_helper             1756  1 aesni_intel
bromolow_synobios      70388  0 
hid_generic             1097  0 
usbhid                 26503  0 
hid                    79391  2 hid_generic,usbhid
usblp                  12378  0 
bnx2x                1412964  0 
mdio                    3373  1 bnx2x
mlx5_core             518875  0 
mlx4_en               114745  0 
mlx4_core             287202  1 mlx4_en
mlx_compat              6499  3 mlx4_en,mlx4_core,mlx5_core
qede                  115603  0 
qed                   797543  1 qede
atlantic_v2           175762  0 
atlantic              167366  0 
tn40xx                 31053  0 
i40e                  350790  0 
ixgbe                 274157  0 
be2net                117652  0 
i2c_algo_bit            5168  0 
igb                   180080  0 
dca                     4576  2 igb,ixgbe
e1000e                212512  0 
sg                     25890  0 
dm_snapshot            26701  0 
crc_itu_t               1275  3 udf,atlantic_v2,atlantic
crc_ccitt               1275  1 ppp_async
psnap                   1725  0 
p8022                    987  0 
llc                     3409  4 stp,p8022,psnap,bridge
zlib_deflate           19996  2 deflate,ppp_deflate
libcrc32c                946  1 bnx2x
hfsplus                91482  0 
md4                     3729  0 
hmac                    2705  0 
sit                    14636  0 
tunnel4                 2069  2 sit,xfrm4_tunnel
ipv6                  303677  172 ah6,sit,esp6,rodsp_ep,xfrm6_mode_beet,nf_defrag_ipv6,l2tp_core,xfrm6_tunnel,xfrm6_mode_tunnel,ipcomp6,nf_conntrack_ipv6
flashcache_syno       227637  1 
flashcache             78870  0 
syno_flashcache_control      943  2 flashcache_syno,flashcache
dm_mod                 70288  11 flashcache_syno,flashcache,dm_snapshot
crc32c_intel           14022  1 
cryptd                  7165  2 aesni_intel,ablk_helper
arc4                    1880  0 
sha256_generic          9668  0 
sha1_generic            2150  0 
ecb                     1857  0 
aes_x86_64              7279  1 aesni_intel
authenc                 6228  0 
des_generic            15891  0 
ansi_cprng              3748  0 
cts                     4040  0 
md5                     2481  0 
cbc                     2456  0 
cpufreq_powersave       1158  0 
cpufreq_performance     1194  8 
acpi_cpufreq            9670  0 
mperf                   1083  1 acpi_cpufreq
processor              29199  9 acpi_cpufreq
thermal_sys            18317  1 processor
cpufreq_stats           3441  0 
freq_table              4692  2 cpufreq_stats,acpi_cpufreq
vxlan                  17560  0 
ip_tunnel              11368  2 sit,vxlan
etxhci_hcd            135146  0 
usb_storage            48966  0 
xhci_hcd               85621  0 
uhci_hcd               23950  0 
ehci_pci                3904  0 
ehci_hcd               47493  1 ehci_pci
usbcore               208589  8 etxhci_hcd,usblp,uhci_hcd,usb_storage,ehci_hcd,ehci_pci,usbhid,xhci_hcd
usb_common              1560  1 usbcore
redpill               135333  0 
OrpheeGT commented 2 years ago

Hello, I installed latest DSM 7.0.1-42218 release with @jumkey repository. I started an influxdb docker container. It runs currently.

Seems to be stable : image

image

No particular log in serial console

Xpen_70 login: [   14.474604] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[   14.475965] <redpill/smart_shim.c:388> Generating fake SMART values
[   14.477197] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd1
[   14.478536] <redpill/smart_shim.c:455> Generating fake SMART thresholds
[   14.480448] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed *no* SMART support - pretending it's there
[   14.482388] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[   14.483751] <redpill/smart_shim.c:388> Generating fake SMART values
[   14.485367] loop: module loaded
[   14.521218] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed *no* SMART support - pretending it's there
[   14.874540] Synotify use 16384 event queue size
[   14.875386] Synotify use 16384 event queue size
[   15.182006] warning: `nginx' uses 32-bit capabilities (legacy support in use)
[   15.182300] warning: `nginx' uses 32-bit capabilities (legacy support in use)
[   16.120927] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed *no* SMART support - pretending it's there
[   16.123476] ata5.00: configured for UDMA/100
[   16.124401] ata5: EH complete
[   16.125104] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd8
[   16.126435] <redpill/smart_shim.c:654> Attempted ATA_SMART_ENABLE modification!
[   16.168629] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed *no* SMART support - pretending it's there
[   16.571387] usbcore: registered new interface driver usbserial
[   16.577185] usbcore: registered new interface driver ftdi_sio
[   16.578205] usbserial: USB Serial support registered for FTDI USB Serial Device
[   16.621996] usbcore: registered new interface driver cdc_acm
[   16.622986] cdc_acm: USB Abstract Control Model driver for USB modems and ISDN adapters
[   16.630496] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[   16.631867] <redpill/smart_shim.c:388> Generating fake SMART values
[   16.719511] ata5.00: configured for UDMA/100
[   16.720273] ata5: EH complete
[   16.725986] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed *no* SMART support - pretending it's there
[   16.727804] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed *no* SMART support - pretending it's there
[   17.762999] iSCSI:target_core_rodsp_server.c:1027:rodsp_server_init RODSP server started, login_key(001132417efd).
[   17.767843] iSCSI:extent_pool.c:766:ep_init syno_extent_pool successfully initialized
[   17.776250] iSCSI:target_core_device.c:613:se_dev_align_max_sectors Rounding down aligned max_sectors from 4294967295 to 4294967288
[   17.778270] iSCSI:target_core_configfs.c:5446:target_init_dbroot db_root: cannot open: /etc/target
[   17.780250] iSCSI:target_core_lunbackup.c:361:init_io_buffer_head 512 buffers allocated, total 2097152 bytes successfully
[   17.790888] iSCSI:target_core_file.c:149:fd_attach_hba RODSP plugin for fileio is enabled.
[   17.792367] iSCSI:target_core_file.c:156:fd_attach_hba ODX Token Manager is enabled.
[   17.793682] iSCSI:target_core_multi_file.c:91:fd_attach_hba RODSP plugin for multifile is enabled.
[   17.795351] iSCSI:target_core_ep.c:795:ep_attach_hba RODSP plugin for epio is enabled.
[   17.796685] iSCSI:target_core_ep.c:802:ep_attach_hba ODX Token Manager is enabled.
[   17.850713] workqueue: max_active 1024 requested for vhost_scsi is out of range, clamping between 1 and 512
[   18.174273] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed *no* SMART support - pretending it's there
[   18.227924] <redpill/intercept_execve.c:87> Blocked /usr/syno/bin/syno_pstore_collect from running
[   18.233306] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[   18.234686] <redpill/smart_shim.c:388> Generating fake SMART values
[   18.255711] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[   18.257082] <redpill/smart_shim.c:388> Generating fake SMART values
[   18.345062] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed *no* SMART support - pretending it's there
[   18.425319] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[   18.426682] <redpill/smart_shim.c:388> Generating fake SMART values
[   18.434181] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[   18.435539] <redpill/smart_shim.c:388> Generating fake SMART values
[   18.494775] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[   18.496131] <redpill/smart_shim.c:388> Generating fake SMART values
[   18.497552] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed *no* SMART support - pretending it's there
[   18.505804] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[   18.507155] <redpill/smart_shim.c:388> Generating fake SMART values
[   18.514408] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[   18.515801] <redpill/smart_shim.c:388> Generating fake SMART values
[   18.740019] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[   18.741374] <redpill/smart_shim.c:388> Generating fake SMART values
[   18.743670] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed *no* SMART support - pretending it's there
[   18.889183] 8021q: 802.1Q VLAN Support v1.8
[   18.889963] 8021q: adding VLAN 0 to HW filter on device eth0
[   18.896645] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[   18.898005] <redpill/smart_shim.c:388> Generating fake SMART values
[   18.899392] <redpill/override_symbol.c:244> Obtaining lock for <ffffffffa09315b0>
[   18.900674] <redpill/override_symbol.c:244> Writing original code to <ffffffffa09315b0>
[   18.902035] <redpill/override_symbol.c:244> Released lock for <ffffffffa09315b0>
[   18.903288] <redpill/override_symbol.c:219> Obtaining lock for <ffffffffa09315b0>
[   18.904558] <redpill/override_symbol.c:219> Writing trampoline code to <ffffffffa09315b0>
[   18.905941] <redpill/override_symbol.c:219> Released lock for <ffffffffa09315b0>
[   18.907196] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0]
[   18.911384] <redpill/pmu_shim.c:310> Got 2 bytes from PMU: reason=1 hex={2d 72} ascii="-r"
[   18.912812] <redpill/pmu_shim.c:239> Executing cmd OUT_SCHED_UP_OFF handler cmd_shim_noop+0x0/0x30 [redpill]
[   18.914462] <redpill/pmu_shim.c:45> vPMU received OUT_SCHED_UP_OFF using 1 bytes - NOOP
[   18.951793] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed *no* SMART support - pretending it's there
[   19.023872] <redpill/pmu_shim.c:310> Got 2 bytes from PMU: reason=1 hex={2d 4b} ascii="-K"
[   19.025314] <redpill/pmu_shim.c:239> Executing cmd OUT_10G_LED_OFF handler cmd_shim_noop+0x0/0x30 [redpill]
[   19.026958] <redpill/pmu_shim.c:45> vPMU received OUT_10G_LED_OFF using 1 bytes - NOOP
[   19.042923] <redpill/bios_shims_collection.c:43> mfgBIOS: nullify zero-int for VTK_SET_HDD_ACT_LED
[   19.054273] <redpill/override_symbol.c:244> Obtaining lock for <ffffffffa09315b0>
[   19.055556] <redpill/override_symbol.c:244> Writing original code to <ffffffffa09315b0>
[   19.056921] <redpill/override_symbol.c:244> Released lock for <ffffffffa09315b0>
[   19.058181] <redpill/override_symbol.c:219> Obtaining lock for <ffffffffa09315b0>
[   19.059456] <redpill/override_symbol.c:219> Writing trampoline code to <ffffffffa09315b0>
[   19.060842] <redpill/override_symbol.c:219> Released lock for <ffffffffa09315b0>
[   19.062096] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=2)->support => real=1 [org_fout=0, ovs_fout=0]
[   19.066492] <redpill/pmu_shim.c:310> Got 2 bytes from PMU: reason=1 hex={2d 38} ascii="-8"
[   19.067933] <redpill/pmu_shim.c:239> Executing cmd OUT_STATUS_LED_ON_GREEN handler cmd_shim_noop+0x0/0x30 [redpill]
[   19.069692] <redpill/pmu_shim.c:45> vPMU received OUT_STATUS_LED_ON_GREEN using 1 bytes - NOOP
[   19.083246] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[   19.086115] <redpill/bios_shims_collection.c:42> mfgBIOS: nullify zero-int for VTK_SET_PHY_LED
[   19.096788] <redpill/bios_shims_collection.c:36> mfgBIOS: nullify zero-int for VTK_SET_PWR_LED
[   19.102305] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[   19.116580] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[   19.125610] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[   19.130418] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[   19.135286] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[   19.139731] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[   19.143309] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[   19.146996] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[   19.150641] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[   19.154152] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[   19.157939] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[   19.161723] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[   19.165220] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[   19.168984] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[   20.308969] ip_tables: (C) 2000-2006 Netfilter Core Team
[   20.321433] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[   20.565485] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   20.602739] aufs 3.10.x-20141110
[   20.614133] Bridge firewalling registered
[   20.628952] cgroup: systemd (1) created nested cgroup for controller "blkio" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
[   21.271179] <redpill/pmu_shim.c:310> Got 2 bytes from PMU: reason=1 hex={2d 33} ascii="-3"
[   21.272823] <redpill/pmu_shim.c:239> Executing cmd OUT_BUZ_LONG handler cmd_shim_noop+0x0/0x30 [redpill]
[   21.274423] <redpill/pmu_shim.c:45> vPMU received OUT_BUZ_LONG using 1 bytes - NOOP
[   21.928329] Synotify use 16384 event queue size
[   21.929175] Synotify use 16384 event queue size
[   22.804317] fuse init (API version 7.22)
[   22.910500] Initializing XFRM netlink socket
[   22.915779] Netfilter messages via NETLINK v0.30.
[   23.009904] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
[   23.271436] <redpill/pmu_shim.c:310> Got 2 bytes from PMU: reason=1 hex={2d 38} ascii="-8"
[   23.272890] <redpill/pmu_shim.c:239> Executing cmd OUT_STATUS_LED_ON_GREEN handler cmd_shim_noop+0x0/0x30 [redpill]
[   23.274647] <redpill/pmu_shim.c:45> vPMU received OUT_STATUS_LED_ON_GREEN using 1 bytes - NOOP
[   35.574578] Synotify use 16384 event queue size
[   37.357255] Synotify use 16384 event queue size
[   78.478758] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[   78.480172] <redpill/smart_shim.c:388> Generating fake SMART values
[  138.433640] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  138.435019] <redpill/smart_shim.c:388> Generating fake SMART values
[  198.388692] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  198.390080] <redpill/smart_shim.c:388> Generating fake SMART values
[  258.343400] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  258.344788] <redpill/smart_shim.c:388> Generating fake SMART values
[  318.298220] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  318.299624] <redpill/smart_shim.c:388> Generating fake SMART values
[  323.767502] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  323.768946] <redpill/smart_shim.c:388> Generating fake SMART values
[  378.253167] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  378.254551] <redpill/smart_shim.c:388> Generating fake SMART values
[  438.208073] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  438.209535] <redpill/smart_shim.c:388> Generating fake SMART values
[  498.163226] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  498.165127] <redpill/smart_shim.c:388> Generating fake SMART values
[  564.099814] <redpill/bios_shims_collection.c:44> mfgBIOS: nullify zero-int for VTK_GET_MICROP_ID
[  564.341764] <redpill/override_symbol.c:244> Obtaining lock for <ffffffffa09315b0>
[  564.343092] <redpill/override_symbol.c:244> Writing original code to <ffffffffa09315b0>
[  564.344466] <redpill/override_symbol.c:244> Released lock for <ffffffffa09315b0>
[  564.345747] <redpill/override_symbol.c:219> Obtaining lock for <ffffffffa09315b0>
[  564.347059] <redpill/override_symbol.c:219> Writing trampoline code to <ffffffffa09315b0>
[  564.348532] <redpill/override_symbol.c:219> Released lock for <ffffffffa09315b0>
[  564.349847] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0]
[  581.768664] <redpill/override_symbol.c:244> Obtaining lock for <ffffffffa09315b0>
[  581.769995] <redpill/override_symbol.c:244> Writing original code to <ffffffffa09315b0>
[  581.771359] <redpill/override_symbol.c:244> Released lock for <ffffffffa09315b0>
[  581.772621] <redpill/override_symbol.c:219> Obtaining lock for <ffffffffa09315b0>
[  581.773912] <redpill/override_symbol.c:219> Writing trampoline code to <ffffffffa09315b0>
[  581.775295] <redpill/override_symbol.c:219> Released lock for <ffffffffa09315b0>
[  581.776545] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0]
[  595.051070] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  595.052407] <redpill/smart_shim.c:388> Generating fake SMART values
[  613.236593] device docker2684dda entered promiscuous mode
[  613.238195] IPv6: ADDRCONF(NETDEV_UP): docker2684dda: link is not ready
[  614.129203] IPv6: ADDRCONF(NETDEV_CHANGE): docker2684dda: link becomes ready
[  614.130448] docker0: port 1(docker2684dda) entered forwarding state
[  614.131505] docker0: port 1(docker2684dda) entered forwarding state
[  614.132576] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready
[  618.168951] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  618.170317] <redpill/smart_shim.c:388> Generating fake SMART values
[  625.768224] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  625.769652] <redpill/smart_shim.c:388> Generating fake SMART values
[  629.156896] docker0: port 1(docker2684dda) entered forwarding state
[  678.089459] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  678.090850] <redpill/smart_shim.c:388> Generating fake SMART values
[  681.275202] <redpill/bios_shims_collection.c:44> mfgBIOS: nullify zero-int for VTK_GET_MICROP_ID
[  738.052874] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  738.054274] <redpill/smart_shim.c:388> Generating fake SMART values
[  798.001343] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  798.002739] <redpill/smart_shim.c:388> Generating fake SMART values
[  857.954479] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  857.955892] <redpill/smart_shim.c:388> Generating fake SMART values
[  917.909508] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  917.910895] <redpill/smart_shim.c:388> Generating fake SMART values
[  926.085597] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  926.087018] <redpill/smart_shim.c:388> Generating fake SMART values
[  977.864580] <redpill/smart_shim.c:644> Got SMART *command* - looking for feature=0xd0
[  977.865996] <redpill/smart_shim.c:388> Generating fake SMART values
WiteWulf commented 2 years ago

By "latest release" do you mean the final release of 7.0.1-42218, rather than the Release Candidate (42214)?

Interestingly I had Plex Media Server kernel panic while doing some indexing last night:

[299932.497289] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
[299932.532970] CPU: 3 PID: 29878 Comm: Plex Media Serv Tainted: PF          O 3.10.108 #42214
[299932.574053] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[299932.609116]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409ac8d60
[299932.645485]  ffff880409ac8cf8 0000000000000000 0000000000000003 0000000000000001
[299932.682316]  0000000000000003 ffffffff80000001 0000000000000030 ffff8803f4c78c00
[299932.719162] Call Trace:
[299932.731603]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[299932.759887]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[299932.784686]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[299932.818219]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[299932.850254]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[299932.883769]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[299932.915418]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[299932.946675]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[299932.971692]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[299932.999870]  <<EOE>> 
[299933.010701] Rebooting in 3 seconds.. 

I'll give the 42218 build a go and see how I get on. I've caught up with your forum posts now and it looks promising 👍

OrpheeGT commented 2 years ago

Seems to be quite stable : image

WiteWulf commented 2 years ago

Not for me :(

It booted up and auto-started my domoticz, librenms and mysql containers, but as soon as I started influxdb it crashed again:

Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 5
[  339.858696] CPU: 5 PID: 26338 Comm: containerd-shim Tainted: PF          O 3.10.108 #42218
[  339.899976] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  339.935073]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409b48d60
[  339.970487]  ffff880409b48cf8 0000000000000000 0000000000000005 0000000000000001
[  340.006961]  0000000000000005 ffffffff80000001 0000000000000030 ffff8803f4d2ec00
[  340.042684] Call Trace:
[  340.054515]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[  340.082943]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[  340.107169]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[  340.139983]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[  340.171295]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[  340.204707]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[  340.235469]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[  340.266764]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[  340.291868]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[  340.319670]  <<EOE>> 
[  340.329598] Rebooting in 3 seconds..
OrpheeGT commented 2 years ago

So because of your test, I just tried to reboot my DSM.

Initializing XFRM netlink socket
Netfilter messages via NETLINK v0.30.
IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
device dockerba315b2 entered promiscuous mode
IPv6: ADDRCONF(NETDEV_UP): dockerba315b2: link is not ready
<redpill/override_symbol.c:244> Obtaining lock for <ffffffffa09385b0>
<redpill/override_symbol.c:244> Writing original code to <ffffffffa09385b0>
<redpill/override_symbol.c:244> Released lock for <ffffffffa09385b0>
<redpill/override_symbol.c:219> Obtaining lock for <ffffffffa09385b0>
<redpill/override_symbol.c:219> Writing trampoline code to <ffffffffa09385b0>
<redpill/override_symbol.c:219> Released lock for <ffffffffa09385b0>
<redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0]
IPv6: ADDRCONF(NETDEV_CHANGE): dockerba315b2: link becomes ready
docker0: port 1(dockerba315b2) entered forwarding state
docker0: port 1(dockerba315b2) entered forwarding state
IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready
BUG: soft lockup - CPU#0 stuck for 41s! [runc:12734]
Modules linked in: nfnetlink xfrm_user xfrm_algo fuse bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4
nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables 8021q vhost_scsi(O) vhost(O) tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) cdc_acm ftdi_sio usbserial udf isofs loop synoacl_vfs(PO) btrfs zstd_decompress ecryptfs zstd_compress xxhash xor raid6_pq adt7475 i2c_i801 aesni_intel glue_helper lrw gf128mul ablk_helper bromolow_synobios(PO) hid_generic zram(C) usbhid hid usblp bnx2x(O) mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) qede(O) qed(O) atlantic_v2(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) i2c_algo_bit igb(O) dca e1000e(O) sg dm_snapshot crc_itu_t crc_ccitt psnap p8022 llc zlib_deflate libcrc32c hfsplus md4 hmac sit tunnel4 ipv6 flashcache_syno(O) flashcache(O) syno_flashcache_control(O) dm_mod crc32c_intel cryptd arc4 sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_powersave cpufreq_performance mperf processor thermal_sys cpufreq_stats freq_table vxlan ip_tunnel etxhci_hcd usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: adt7475]
CPU: 0 PID: 12734 Comm: runc Tainted: PF        C O 3.10.108 #42218
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
task: ffff880111e7b040 ti: ffff8800367e0000 task.ti: ffff8800367e0000
RIP: 0010:[<ffffffff8108fee6>]  [<ffffffff8108fee6>] generic_exec_single+0x76/0xe0
RSP: 0000:ffff8800367e3c20  EFLAGS: 00000202
RAX: 00000000000008fb RBX: 00000037ffffffc8 RCX: 0000000000000038
RDX: 0000000000000010 RSI: 00000000000000fb RDI: ffffffff81606630
RBP: ffff8800367e3c60 R08: ffff8801363ba358 R09: 0000000000000020
R10: 0000000000004042 R11: 0000000000000000 R12: 0000004000000001
R13: 0000000000000000 R14: 0000005000000041 R15: ffffffff81894a98
FS:  00007f18986ea740(0000) GS:ffff88013dc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1898549618 CR3: 00000000b6890000 CR4: 00000000001607f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 0000000000000000 ffff8800367e3cb0 0000000000000001 ffffffff818a90d0
 ffffffff8102fcc0 ffffffff8109007e 0000000000000001 ffffffff818a90d0
 ffff88013dd13980 ffff88013dd13980 ffffffff8102fcc0 ffff8800367e3cc0
Call Trace:
 [<ffffffff8102fcc0>] ? do_flush_tlb_all+0x160/0x160
 [<ffffffff8109007e>] ? smp_call_function_single+0x12e/0x150
 [<ffffffff8102fcc0>] ? do_flush_tlb_all+0x160/0x160
 [<ffffffff81030372>] ? flush_tlb_page+0x72/0x130
 [<ffffffff81117572>] ? ptep_clear_flush+0x22/0x30
 [<ffffffff81106a0d>] ? do_wp_page+0x2ad/0x8c0
 [<ffffffff81107e6d>] ? handle_pte_fault+0x38d/0x9e0
 [<ffffffff81108775>] ? handle_mm_fault+0x135/0x2e0
 [<ffffffff8112c6e2>] ? do_sync_read+0x82/0xb0
 [<ffffffff814aba0a>] ? __do_page_fault+0x14a/0x500
 [<ffffffff8112d590>] ? vfs_read+0x140/0x170
 [<ffffffff8112ee34>] ? SyS_read+0x84/0xb0
 [<ffffffff814a8592>] ? page_fault+0x22/0x30
Code: 89 55 08 48 89 2a e8 8a 78 41 00 4c 39 f3 75 0f 44 89 e7 48 8b 05 fb f1 78 00 e8 96 4d 20 00 f6 45 20 01 74 08 f3 90 f6 45 20 01 <75> f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00 00 4c 8d 6b

It froze at boot... Sorry for the fake hope...

WiteWulf commented 2 years ago

Getting a new and interesting kernel panic now! :-)

[  593.024485] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
[  593.060023] CPU: 3 PID: 18971 Comm: containerd-shim Tainted: PF          O 3.10.108 #42218
[  593.100803] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  593.135423]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409ac8d60
[  593.172094]  ffff880409ac8cf8 0000000000000000 0000000000000003 0000000000000001
[  593.208074]  0000000000000003 ffffffff80000001 0000000000000030 ffff8803f4c78c00
[  593.243754] Call Trace:
[  593.255367]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[  593.284098]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[  593.308381]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[  593.340891]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[  593.371816]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[  593.404774]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[  593.435740]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[  593.467106]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[  593.492121]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[  593.520361]  [<ffffffff8108fee6>] ? generic_exec_single+0x76/0xe0
[  593.550697]  [<ffffffff8108fee6>] ? generic_exec_single+0x76/0xe0
[  593.580730]  [<ffffffff8108fee6>] ? generic_exec_single+0x76/0xe0
[  593.610802]  <<EOE>>  [<ffffffff8102fcc0>] ? do_flush_tlb_all+0x160/0x160
[  593.643849]  [<ffffffff8109007e>] ? smp_call_function_single+0x12e/0x150
[  593.676891]  [<ffffffff8102fcc0>] ? do_flush_tlb_all+0x160/0x160
[  593.706277]  [<ffffffff81030372>] ? flush_tlb_page+0x72/0x130
[  593.734648]  [<ffffffff81117572>] ? ptep_clear_flush+0x22/0x30
[  593.763265]  [<ffffffff81106a0d>] ? do_wp_page+0x2ad/0x8c0
[  593.790617]  [<ffffffff81107e6d>] ? handle_pte_fault+0x38d/0x9e0
[  593.820103]  [<ffffffff81108775>] ? handle_mm_fault+0x135/0x2e0
[  593.848715]  [<ffffffff814aba0a>] ? __do_page_fault+0x14a/0x500
[  593.877749]  [<ffffffff8110f430>] ? do_mmap_pgoff+0x250/0x410
[  593.905799]  [<ffffffff810fc2bb>] ? vm_mmap_pgoff+0x9b/0xc0
[  593.933458]  [<ffffffff814a8592>] ? page_fault+0x22/0x30
[  594.996310] Shutting down cpus with NMI
[  595.015070] Rebooting in 3 seconds..

...and another Plex kernel panic after the reboot:

[  468.368528] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
[  468.403728] CPU: 3 PID: 18919 Comm: Plex Media Serv Tainted: PF          O 3.10.108 #42218
[  468.445354] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  468.480629]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409ac8d60
[  468.517148]  ffff880409ac8cf8 0000000000000000 0000000000000003 0000000000000001
[  468.553582]  0000000000000003 ffffffff80000001 0000000000000030 ffff8803f4c78c00
[  468.588426] Call Trace:
[  468.600328]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[  468.629180]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[  468.653539]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[  468.685695]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[  468.716421]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[  468.748905]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[  468.779646]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[  468.810436]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[  468.835257]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[  468.863158]  <<EOE>> 
[  468.873772] Rebooting in 3 seconds..

This was shortly after I manually started a docker container, so may be connected.

pocopico commented 2 years ago

A little more detail on CPU lockups that i've found on the following link :

https://access.redhat.com/solutions/1354963

more info : https://kernel.googlesource.com/pub/scm/linux/kernel/git/arm64/linux/+/v3.1-rc3/Documentation/nmi_watchdog.txt

WiteWulf commented 2 years ago

Thanks @pocopico, I linked to that redhat in the first post at the top of this page :)

As you say, though, increasing the threshold or disabling the watchdog isn't a fix for the kernel panics, you're just delaying or cancelling the OS trying to fix what it knows is wrong.

FWIW, though, I've tried triggering a kernel panic using dd as described in that article and it's never crashed.

pocopico commented 2 years ago

Sometimes before panic i get,

perf interrupt took too long (5084 > 10000), lowering kernel.perf_event_max_sample_rate to 25000

All panics that i've seen are related to __perf_event_overflow

For the record, i've tested the prod.v7 and panics as well.

WiteWulf commented 2 years ago
[ 4596.653389] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
[ 4596.688405] Rebooting in 3 seconds..

That was a very terse crash!

Since moving to 7.0.1-42218 and redpill-lkm 3474d9b I'm seeing repeated kernel panics from Plex Media Server (I've stopped running docker now as it was crashing with every container I ran).

[  769.604912] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
[  769.640940] CPU: 3 PID: 29366 Comm: Plex Media Serv Tainted: PF          O 3.10.108 #42218
[  769.681438] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  769.716091]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409ac8d60
[  769.752376]  ffff880409ac8cf8 0000000000000000 0000000000000003 0000000000000001
[  769.789236]  0000000000000003 ffffffff80000001 0000000000000030 ffff8803f4c78c00
[  769.824999] Call Trace:
[  769.837281]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[  769.865862]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[  769.890602]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[  769.923516]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[  769.954562]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[  769.987207]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[  770.017939]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[  770.048170]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[  770.073824]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[  770.101708]  <<EOE>> 
[  770.111895] Rebooting in 3 seconds.. 

The Plex crashes weren't happening before on 7.0.1-42218 (RC1) with redpill-lkm 3474d9b or 021ed51 (the older, more stable, version I'd been using as a comparison).

edit I've had to stop Plex from running on this machine now as it was crashing the server constantly, even with nothing playing. As mentioned on the forum, my server is now stable with docker and Plex disabled, just not very useful :)

ceozero commented 2 years ago

In dell r740xd esxi 7.0.2 Using 918-DSM7.0.1 everything is normal. You cannot install docker with 3615-DSM7.0.1, otherwise it will crash if you reboot it. Hope to fix it sooner. image

txb2d commented 2 years ago

gen8 8 x Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz pve系统 run docer and synologyphoto 人脸识别,system 会崩溃!!!

WiteWulf commented 2 years ago

Via Google translate, for the benefit of the non-Chinese speaking:

Run docer and synologyphoto face recognition, the system will crash! ! !

ceozero commented 2 years ago

@WiteWulf Did you feed back this bug to the forum? The team doesn't seem to reply to other posts. https://xpenology.com/forum/topic/45795-redpill-the-new-loader-for-624-discussion/