LINBIT / drbd

LINBIT DRBD kernel module
https://docs.linbit.com/docs/users-guide-9.0/
GNU General Public License v2.0
587 stars 100 forks source link

Kernel Panic with 9.2.8 #86

Closed AleksZimin closed 7 months ago

AleksZimin commented 8 months ago

Hello,

We are experiencing a critical issue with DRBD 9.2.8 running in a 5-node cluster environment. Occasionally, several servers in the cluster undergo unexpected reboots. In one instance, all servers rebooted simultaneously. Most recently, we encountered situations where servers rebooted, and we were able to capture full Call Traces for these incidents.

First Incident Call Trace:

Mar 17 01:38:20 offine-stand-stor-0  [ 4333.309042] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479 offline-stand-stor-2: Preparing remote state change 2720219251
Mar 17 01:38:20 offine-stand-stor-0  [ 4333.329658] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479 offline-stand-stor-2: Committing remote state change 2720219251 (primary_nodes=0)
Mar 17 01:38:20 offine-stand-stor-0  [ 4333.337403] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479/0 drbd1070 offline-stand-stor-2: pdsk( UpToDate -> Detaching ) [remote]
Mar 17 01:38:20 offine-stand-stor-0  [ 4333.361074] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479: Preparing cluster-wide state change 2146026655 (1->-1 7680/1024)
Mar 17 01:38:20 offine-stand-stor-0  [ 4333.368190] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479: State change 2146026655: primary_nodes=0, weak_nodes=0
Mar 17 01:38:20 offine-stand-stor-0  [ 4333.373618] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479: Committing cluster-wide state change 2146026655 (12ms)
Mar 17 01:38:20 offine-stand-stor-0  [ 4333.381997] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479/0 drbd1070 offline-stand-stor-2: pdsk( Detaching -> Diskless ) [peer-state]
Mar 17 01:38:20 offine-stand-stor-0  [ 4333.390526] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479/0 drbd1070: disk( UpToDate -> Detaching ) [detach]
Mar 17 01:38:20 offine-stand-stor-0  [ 4333.407933] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479/0 drbd1070: Would lose quorum, but using tiebreaker logic to keep
Mar 17 01:38:20 offine-stand-stor-0  [ 4333.410349] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479/0 drbd1070: disk( Detaching -> Diskless ) [go-diskless]
Mar 17 01:38:20 offine-stand-stor-0  [ 4333.430930] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479/0 drbd1070: drbd_bm_resize called with capacity == 0
Mar 17 01:38:20 offine-stand-stor-0  [ 4333.495886] eth0: renamed from tmp1ef17
Mar 17 01:38:20 offine-stand-stor-0  [ 4333.531000] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Mar 17 01:38:20 offine-stand-stor-0  [ 4333.531916] IPv6: ADDRCONF(NETDEV_CHANGE): lxce958c5d240e4: link becomes ready
Mar 17 01:38:21 offine-stand-stor-0  [ 4333.876144] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479: ASSERTION context->flags & CS_SERIALIZE FAILED in change_cluster_wide_state
Mar 17 01:38:21 offine-stand-stor-0  [ 4333.877717] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479: State change failed: State change was refused by peer node (-10)
Mar 17 01:38:21 offine-stand-stor-0  [ 4333.878487] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479: Failed: susp-io( no -> quorum ) [del-minor]
Mar 17 01:38:21 offine-stand-stor-0  [ 4333.879278] drbd /unregistered/pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479/0 drbd1070: Failed: quorum( yes -> no ) [del-minor]
Mar 17 01:38:21 offine-stand-stor-0  [ 4333.880013] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479/0 drbd1070 offline-stand-stor-3: Failed: pdsk( Diskless -> DUnknown ) repl( Established -> Off ) [del-minor]
Mar 17 01:38:21 offine-stand-stor-0  [ 4333.882034] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479: ASSERTION context->flags & CS_SERIALIZE FAILED in change_cluster_wide_state
Mar 17 01:38:21 offine-stand-stor-0  [ 4333.884237] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479: State change failed: State change was refused by peer node (-10)
Mar 17 01:38:21 offine-stand-stor-0  [ 4333.885172] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479: Failed: susp-io( no -> quorum ) [del-minor]
Mar 17 01:38:21 offine-stand-stor-0  [ 4333.886105] drbd /unregistered/pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479/0 drbd1070: Failed: quorum( yes -> no ) [del-minor]
Mar 17 01:38:21 offine-stand-stor-0  [ 4333.887037] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479/0 drbd1070 offline-stand-stor-2: Failed: pdsk( Diskless -> DUnknown ) repl( Established -> Off ) [del-minor]
Mar 17 01:38:21 offine-stand-stor-0  [ 4334.013360] WARNING: chroot access!
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.082850] WARNING: chroot access!
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.693626] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479 offline-stand-stor-3: conn( Connected -> Disconnecting ) peer( Secondary -> Unknown ) [del-peer]
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.703867] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479 offline-stand-stor-3: Terminating sender thread
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.706924] drbd pvc-daaa4fed-540a-4bb8-ad50-d6ba07126479 offline-stand-stor-3: Starting sender thread (from drbd_r_pvc-daaa [13435])
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.762465] BUG: kernel NULL pointer dereference, address: 000000000000078c
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.763253] #PF: supervisor write access in kernel mode
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.763993] #PF: error_code(0x0002) - not-present page
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.764865] PGD 0 P4D 0 
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.765591] Oops: 0002 [#1] SMP PTI
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.766259] CPU: 5 PID: 13435 Comm: drbd_r_pvc-daaa Kdump: loaded Tainted: G           OE     5.15.0-83-generic #astra1+ci14
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.767117] Hardware name: Supermicro SYS-5039MS-H8TRF/X11SSD-F, BIOS 2.3 12/20/2019
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.767795] RIP: 0010:_raw_spin_lock_irq+0x17/0x40
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.768501] Code: cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 fa 66 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 06 5d c3 cc cc cc cc 89 c6 e8 d6 c5 42 ff 66 90 5d
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.769906] RSP: 0018:ffffa24d3e1e3c48 EFLAGS: 00010046
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.770660] RAX: 0000000000000000 RBX: ffff902fc347e780 RCX: 0000000000000000
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.771336] RDX: 0000000000000001 RSI: ffffa24d3e1e3ca0 RDI: 000000000000078c
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.771997] RBP: ffffa24d3e1e3c48 R08: ffff90306da773e0 R09: ffff90306da773e0
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.772757] R10: ffff90306da773e0 R11: ffff90306da773e0 R12: 0000000000000001
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.773396] R13: ffff902e8a073000 R14: 000000000000078c R15: ffff90306da77000
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.774032] FS:  0000000000000000(0000) GS:ffff9035d7b40000(0000) knlGS:0000000000000000
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.774712] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.775423] CR2: 000000000000078c CR3: 000000058e410005 CR4: 00000000003706e0
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.776071] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.776690] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.777303] Call Trace:
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.777930]  <TASK>
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.778568]  ? show_regs.cold.16+0x1a/0x1f
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.779240]  ? __die_body+0x1f/0x70
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.779861]  ? __die+0x2a/0x35
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.780448]  ? page_fault_oops+0x136/0x2b0
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.781054]  ? do_user_addr_fault+0x33e/0x660
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.781667]  ? finish_task_switch+0x81/0x2a0
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.782385]  ? exc_page_fault+0x7e/0x170
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.783023]  ? asm_exc_page_fault+0x27/0x30
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.783604]  ? _raw_spin_lock_irq+0x17/0x40
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.784143]  drbd_free_peer_req+0xa9/0x240 [drbd]
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.784688]  drbd_finish_peer_reqs+0xc2/0x180 [drbd]
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.785211]  drain_resync_activity+0x579/0xdc0 [drbd]
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.785720]  ? wake_up_q+0x4e/0x90
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.786204]  ? __mutex_unlock_slowpath.isra.24+0x9c/0x110
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.786691]  ? mutex_unlock+0x26/0x30
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.787162]  conn_disconnect+0x1b3/0xa40 [drbd]
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.787643]  drbd_receiver+0x5ef/0x990 [drbd]
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.788103]  ? drbd_unplug_all_devices+0x50/0x50 [drbd]
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.788593]  drbd_thread_setup+0x85/0x1e0 [drbd]
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.789081]  ? inc_open_count+0xb0/0xb0 [drbd]
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.789532]  kthread+0x12d/0x150
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.790280]  ? set_kthread_struct+0x50/0x50
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.790903]  ret_from_fork+0x1f/0x30
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.791402]  </TASK>
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.791792] Modules linked in: netconsole(E) drbd_transport_tcp(OE) udp_diag(E) ip_set(E) xt_CT(E) cls_bpf(E) sch_ingress(E) vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) sch_fq(E) bcache(E) crc64(E) dm_cache(E) dm_writecache(E) xfrm_user(E) xfrm_algo(E) veth(E) nvme_rdma(E) nvme_fabrics(E) nvmet_rdma(E) nvmet(E) nvme_core(E) rdma_cm(E) iw_cm(E) ib_cm(E) nf_tables(E) ib_core(E) nfnetlink(E) xt_socket(E) nf_socket_ipv4(E) nf_socket_ipv6(E) ip6table_raw(E) iptable_raw(E) ip6table_filter(E) ip6table_nat(E) ip6table_mangle(E) ip6_tables(E) xt_MASQUERADE(E) xt_mark(E) iptable_nat(E) nf_nat(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) xt_comment(E) iptable_filter(E) iptable_mangle(E) bpfilter(E) dm_thin_pool(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) tcp_diag(E) inet_diag(E) aufs(E) overlay(E) intel_rapl_msr(E) intel_rapl_common(E) intel_tcc_cooling(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E)
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.791850]  crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) crypto_simd(E) cryptd(E) rapl(E) intel_cstate(E) ipmi_ssif(E) ast(E) drm_vram_helper(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) cec(E) rc_core(E) mei_me(E) drm(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) joydev(E) sysimgblt(E) ee1004(E) mei(E) input_leds(E) intel_pch_thermal(E) ie31200_edac(E) acpi_ipmi(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_power_meter(E) acpi_pad(E) mac_hid(E) handshake(OE) drbd(OE) lru_cache(E) libcrc32c(E) br_netfilter(E) bridge(E) stp(E) llc(E) parport_pc(E) ppdev(E) lp(E) parport(E) sunrpc(E) ip_tables(E) x_tables(E) autofs4(E) hid_generic(E) usbhid(E) hid(E) i2c_i801(E) i2c_smbus(E) igb(E) intel_ish_ipc(E) xhci_pci(E) i2c_algo_bit(E) xhci_pci_renesas(E) intel_ishtp(E) dca(E) video(E) parsec(OE) digsig_verif(OE) [last unloaded: netconsole]
Mar 17 01:38:22 offine-stand-stor-0  [ 4334.798957] CR2: 000000000000078c

Second Incident Call Trace:

Mar 17 01:59:55 offine-stand-stor-0  [  460.678989] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a offline-stand-stor-2: Preparing remote state change 3569889254
Mar 17 01:59:55 offine-stand-stor-0  [  460.699329] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a offline-stand-stor-2: Committing remote state change 3569889254 (primary_nodes=0)
Mar 17 01:59:55 offine-stand-stor-0  [  460.702481] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a/0 drbd1054 offline-stand-stor-2: pdsk( UpToDate -> Detaching ) [remote]
Mar 17 01:59:55 offine-stand-stor-0  [  460.719818] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a: Preparing cluster-wide state change 411107339 (0->-1 7680/1024)
Mar 17 01:59:55 offine-stand-stor-0  [  460.721900] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a/0 drbd1054 offline-stand-stor-2: pdsk( Detaching -> Diskless ) [peer-state]
Mar 17 01:59:55 offine-stand-stor-0  [  460.735629] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a: State change 411107339: primary_nodes=0, weak_nodes=0
Mar 17 01:59:55 offine-stand-stor-0  [  460.736683] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a: Committing cluster-wide state change 411107339 (20ms)
Mar 17 01:59:55 offine-stand-stor-0  [  460.737923] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a/0 drbd1054: disk( UpToDate -> Detaching ) [detach]
Mar 17 01:59:55 offine-stand-stor-0  [  460.740155] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a/0 drbd1054: Would lose quorum, but using tiebreaker logic to keep
Mar 17 01:59:55 offine-stand-stor-0  [  460.740886] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a/0 drbd1054: disk( Detaching -> Diskless ) [go-diskless]
Mar 17 01:59:55 offine-stand-stor-0  [  460.758615] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a/0 drbd1054: drbd_bm_resize called with capacity == 0
Mar 17 01:59:56 offine-stand-stor-0  [  461.096657] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a: ASSERTION context->flags & CS_SERIALIZE FAILED in change_cluster_wide_state
Mar 17 01:59:56 offine-stand-stor-0  [  461.098095] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a: State change failed: State change was refused by peer node (-10)
Mar 17 01:59:56 offine-stand-stor-0  [  461.098830] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a: Failed: susp-io( no -> quorum ) [del-minor]
Mar 17 01:59:56 offine-stand-stor-0  [  461.099539] drbd /unregistered/pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a/0 drbd1054: Failed: quorum( yes -> no ) [del-minor]
Mar 17 01:59:56 offine-stand-stor-0  [  461.100281] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a/0 drbd1054 offline-stand-stor-2: Failed: pdsk( Diskless -> DUnknown ) repl( Established -> Off ) [del-minor]
Mar 17 01:59:56 offine-stand-stor-0  [  461.102205] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a: ASSERTION context->flags & CS_SERIALIZE FAILED in change_cluster_wide_state
Mar 17 01:59:56 offine-stand-stor-0  [  461.104149] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a: State change failed: State change was refused by peer node (-10)
Mar 17 01:59:56 offine-stand-stor-0  [  461.105104] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a: Failed: susp-io( no -> quorum ) [del-minor]
Mar 17 01:59:56 offine-stand-stor-0  [  461.106169] drbd /unregistered/pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a/0 drbd1054: Failed: quorum( yes -> no ) [del-minor]
Mar 17 01:59:56 offine-stand-stor-0  [  461.107126] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a/0 drbd1054 offline-stand-stor-1: Failed: pdsk( Diskless -> DUnknown ) repl( Established -> Off ) [del-minor]
Mar 17 01:59:57 offine-stand-stor-0  [  462.212691] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a offline-stand-stor-1: sock was shut down by peer
Mar 17 01:59:57 offine-stand-stor-0  [  462.212741] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a offline-stand-stor-1: meta connection shut down by peer.
Mar 17 01:59:57 offine-stand-stor-0  [  462.213491] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a offline-stand-stor-1: conn( Connected -> BrokenPipe ) peer( Secondary -> Unknown )
Mar 17 01:59:57 offine-stand-stor-0  [  462.215789] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a: Preparing cluster-wide state change 605233927 (0->-1 0/0)
Mar 17 01:59:57 offine-stand-stor-0  [  462.233366] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a offline-stand-stor-1: Terminating sender thread
Mar 17 01:59:57 offine-stand-stor-0  [  462.234120] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a offline-stand-stor-1: Starting sender thread (from drbd_r_pvc-fc9d [12946])
Mar 17 01:59:57 offine-stand-stor-0  [  462.235535] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a: State change 605233927: primary_nodes=0, weak_nodes=0
Mar 17 01:59:57 offine-stand-stor-0  [  462.236317] drbd pvc-fc9def6f-a8c8-442f-beb1-f5a61262fc7a: Committing cluster-wide state change 605233927 (24ms)
Mar 17 01:59:57 offine-stand-stor-0  [  462.263975] BUG: kernel NULL pointer dereference, address: 000000000000038c
Mar 17 01:59:57 offine-stand-stor-0  [  462.264705] #PF: supervisor write access in kernel mode
Mar 17 01:59:57 offine-stand-stor-0  [  462.265422] #PF: error_code(0x0002) - not-present page
Mar 17 01:59:57 offine-stand-stor-0  [  462.266124] PGD 0 P4D 0 
Mar 17 01:59:57 offine-stand-stor-0  [  462.266813] Oops: 0002 [#1] SMP PTI
Mar 17 01:59:57 offine-stand-stor-0  [  462.267525] CPU: 1 PID: 12946 Comm: drbd_r_pvc-fc9d Kdump: loaded Tainted: G           OE     5.15.0-83-generic #astra1+ci14
Mar 17 01:59:57 offine-stand-stor-0  [  462.268226] Hardware name: Supermicro SYS-5039MS-H8TRF/X11SSD-F, BIOS 2.3 12/20/2019
Mar 17 01:59:57 offine-stand-stor-0  [  462.268918] RIP: 0010:_raw_spin_lock_irq+0x17/0x40
Mar 17 01:59:57 offine-stand-stor-0  [  462.269616] Code: cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 fa 66 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 06 5d c3 cc cc cc cc 89 c6 e8 d6 c5 42 ff 66 90 5d
Mar 17 01:59:57 offine-stand-stor-0  [  462.271066] RSP: 0018:ffffb93ebe2b7c48 EFLAGS: 00010046
Mar 17 01:59:57 offine-stand-stor-0  [  462.271840] RAX: 0000000000000000 RBX: ffff9ec8c6fc3e40 RCX: 0000000000000000
Mar 17 01:59:57 offine-stand-stor-0  [  462.272555] RDX: 0000000000000001 RSI: ffffb93ebe2b7ca0 RDI: 000000000000038c
Mar 17 01:59:57 offine-stand-stor-0  [  462.273248] RBP: ffffb93ebe2b7c48 R08: ffff9eca8f6fd3e0 R09: ffff9eca8f6fd3e0
Mar 17 01:59:57 offine-stand-stor-0  [  462.273944] R10: ffff9eca8f6fd3e0 R11: ffff9eca8f6fd3e0 R12: 0000000000000001
Mar 17 01:59:57 offine-stand-stor-0  [  462.274600] R13: ffff9ec9e441b800 R14: 000000000000038c R15: ffff9eca8f6fd000
Mar 17 01:59:57 offine-stand-stor-0  [  462.275252] FS:  0000000000000000(0000) GS:ffff9ed017a40000(0000) knlGS:0000000000000000
Mar 17 01:59:57 offine-stand-stor-0  [  462.275900] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 17 01:59:57 offine-stand-stor-0  [  462.276568] CR2: 000000000000038c CR3: 0000000312c10006 CR4: 00000000003706e0
Mar 17 01:59:57 offine-stand-stor-0  [  462.277207] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 17 01:59:57 offine-stand-stor-0  [  462.277862] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 17 01:59:57 offine-stand-stor-0  [  462.278534] Call Trace:
Mar 17 01:59:57 offine-stand-stor-0  [  462.279259]  <TASK>
Mar 17 01:59:57 offine-stand-stor-0  [  462.279938]  ? show_regs.cold.16+0x1a/0x1f
Mar 17 01:59:57 offine-stand-stor-0  [  462.280548]  ? __die_body+0x1f/0x70
Mar 17 01:59:57 offine-stand-stor-0  [  462.281175]  ? __die+0x2a/0x35
Mar 17 01:59:57 offine-stand-stor-0  [  462.281752]  ? page_fault_oops+0x136/0x2b0
Mar 17 01:59:57 offine-stand-stor-0  [  462.282375]  ? do_user_addr_fault+0x33e/0x660
Mar 17 01:59:57 offine-stand-stor-0  [  462.282953]  ? finish_task_switch+0x81/0x2a0
Mar 17 01:59:57 offine-stand-stor-0  [  462.283551]  ? exc_page_fault+0x7e/0x170
Mar 17 01:59:57 offine-stand-stor-0  [  462.284180]  ? asm_exc_page_fault+0x27/0x30
Mar 17 01:59:57 offine-stand-stor-0  [  462.284738]  ? _raw_spin_lock_irq+0x17/0x40
Mar 17 01:59:57 offine-stand-stor-0  [  462.285355]  drbd_free_peer_req+0xa9/0x240 [drbd]
Mar 17 01:59:57 offine-stand-stor-0  [  462.285905]  drbd_finish_peer_reqs+0xc2/0x180 [drbd]
Mar 17 01:59:57 offine-stand-stor-0  [  462.286463]  drain_resync_activity+0x579/0xdc0 [drbd]
Mar 17 01:59:57 offine-stand-stor-0  [  462.287001]  ? wake_up_q+0x4e/0x90
Mar 17 01:59:57 offine-stand-stor-0  [  462.287483]  ? __mutex_unlock_slowpath.isra.24+0x9c/0x110
Mar 17 01:59:57 offine-stand-stor-0  [  462.288077]  ? mutex_unlock+0x26/0x30
Mar 17 01:59:57 offine-stand-stor-0  [  462.288539]  conn_disconnect+0x1b3/0xa40 [drbd]
Mar 17 01:59:57 offine-stand-stor-0  [  462.289045]  drbd_receiver+0x5ef/0x990 [drbd]
Mar 17 01:59:57 offine-stand-stor-0  [  462.289515]  ? drbd_unplug_all_devices+0x50/0x50 [drbd]
Mar 17 01:59:57 offine-stand-stor-0  [  462.290042]  drbd_thread_setup+0x85/0x1e0 [drbd]
Mar 17 01:59:57 offine-stand-stor-0  [  462.290486]  ? inc_open_count+0xb0/0xb0 [drbd]
Mar 17 01:59:57 offine-stand-stor-0  [  462.290925]  kthread+0x12d/0x150
Mar 17 01:59:57 offine-stand-stor-0  [  462.291348]  ? set_kthread_struct+0x50/0x50
Mar 17 01:59:57 offine-stand-stor-0  [  462.291761]  ret_from_fork+0x1f/0x30
Mar 17 01:59:57 offine-stand-stor-0  [  462.292192]  </TASK>
Mar 17 01:59:57 offine-stand-stor-0  [  462.292583] Modules linked in: drbd_transport_tcp(OE) udp_diag(E) ip_set(E) xt_CT(E) cls_bpf(E) sch_ingress(E) vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) sch_fq(E) bcache(E) crc64(E) xfrm_user(E) dm_cache(E) xfrm_algo(E) dm_writecache(E) veth(E) nf_tables(E) nfnetlink(E) xt_socket(E) nf_socket_ipv4(E) nf_socket_ipv6(E) ip6table_raw(E) iptable_raw(E) nvme_rdma(E) nvme_fabrics(E) nvmet_rdma(E) nvmet(E) nvme_core(E) rdma_cm(E) iw_cm(E) ib_cm(E) ib_core(E) ip6table_filter(E) ip6table_nat(E) ip6table_mangle(E) ip6_tables(E) xt_MASQUERADE(E) xt_mark(E) iptable_nat(E) nf_nat(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) xt_comment(E) iptable_filter(E) iptable_mangle(E) bpfilter(E) dm_thin_pool(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) tcp_diag(E) inet_diag(E) aufs(E) overlay(E) intel_rapl_msr(E) intel_rapl_common(E) intel_tcc_cooling(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E)
Mar 17 01:59:57 offine-stand-stor-0  [  462.292625]  ghash_clmulni_intel(E) aesni_intel(E) crypto_simd(E) cryptd(E) rapl(E) intel_cstate(E) ipmi_ssif(E) ast(E) drm_vram_helper(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) cec(E) rc_core(E) drm(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) input_leds(E) joydev(E) sysimgblt(E) ee1004(E) mei_me(E) acpi_ipmi(E) intel_pch_thermal(E) mei(E) ipmi_si(E) ie31200_edac(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_power_meter(E) acpi_pad(E) mac_hid(E) netconsole(E) handshake(OE) drbd(OE) lru_cache(E) libcrc32c(E) br_netfilter(E) bridge(E) stp(E) llc(E) parport_pc(E) ppdev(E) lp(E) parport(E) sunrpc(E) ip_tables(E) x_tables(E) autofs4(E) hid_generic(E) usbhid(E) hid(E) i2c_i801(E) i2c_smbus(E) igb(E) intel_ish_ipc(E) xhci_pci(E) i2c_algo_bit(E) xhci_pci_renesas(E) intel_ishtp(E) dca(E) video(E) parsec(OE) digsig_verif(OE)
Mar 17 01:59:57 offine-stand-stor-0  [  462.299942] CR2: 000000000000038c

DRBD version:

cat /proc/drbd
version: 9.2.8 (api:2/proto:86-122)
GIT-hash:123456 build by @offine-stand-stor-0, 2024-03-14 14:27:44
Transports (api:20): tcp (9.2.8)

Please find the attached log file for more detailed information surrounding the kernel panic incident. offine-stand-stor-0.log

Thank you in advance for your support.

ksyblast commented 7 months ago

We have a similar issue in Kubernetes https://github.com/piraeusdatastore/piraeus/issues/178, looks like it's connected

Philipp-Reisner commented 7 months ago

Fixed with commits 857db82c989b36993ff7a3df3944c9862db1408d and 343e077e9664b203e5ebf8146dacc5c869b80e30.