LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
984 stars 76 forks source link

CPU stuck when evacuating node #406

Closed 0sorkon closed 6 months ago

0sorkon commented 6 months ago

There is a linstore pool on top of a thin lvm on ssd and hdd discs (different storage pools). Replication factor 2. It was necessary to take one of the three nodes out of operation while preserving resources. When trying to evacuate a node as described in the documentation, a message appears in the logs:

Apr 26 08:24:59 cloud kernel: [11694193.255222] WARNING: CPU: 36 PID: 4020797 at /var/lib/dkms/drbd/9.2.5-

1ppa1~jammy1/build/src/drbd/drbd_bitmap.c:1278 bm_rw_range.constprop.0+0x4d5/0 x570 [drbd] Apr 26 08:24:59 cloud kernel: [11694193.255244] Modules linked in: sctp ip6t_REJECT nf_reject_ipv6 nfsv3 nfs_acl cpuid dm_mirror dm_region_hash dm_log xt_hl ip6_tables ip6t_rt xt_LOG n f_log_syslog vxlan ip6_udp_tunnel udp_tunnel bluetooth ecdh_generic ecc vhost_net vhost vhost_iotlb tap act_police cls_u32 sch_ingress cls_fw sch_sfq sch_htb dm_snapshot ipset rpcsec gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs xt_recent dm_writecache nvme_rdma nvmet_rdma nvmet rdma_cm iw_cm ib_cm nvme_fabrics ip_gre ip_tunnel gre 8021q garp mrp bonding ipt_REJECT nf_reject_ipv4 xt_multiport nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nft_counter ipmi_ssif sunrpc nf_tabl es binfmt_misc nfnetlink intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd dell_wmi ledtrig_audio kvm_amd sparse_keymap video kvm dell_smbios dcdbas rapl wmi_bmof dell_wmi_descr iptor joydev input_leds ccp ptdma k10temp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid Apr 26 08:24:59 cloud kernel: [11694193.255310] sch_fq_codel drbd_transport_tcp(OE) bcache crc64 drbd(OE) lru_cache br_netfilter bridge stp llc dm_multipath scsi_dh_rdac scsi_dh_emc s csi_dh_alua msr efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 r aid0 multipath linear mlx5_ib ib_uverbs ib_core hid_generic dm_cache_smq usbhid hid mgag200 dm_cache i2c_algo_bit drm_kms_helper dm_thin_pool syscopyarea dm_persistent_data sysfillrect sysimgblt fb_sys_fops dm_bio_prison crct10dif_pclmul cec crc32_pclmul dm_bufio ghash_clmulni_intel mlx5_core rc_core mlxfw libcrc32c aesni_intel psample crypto_simd nvme drm ahci xhci _pci tls cryptd tg3 pci_hyperv_intf libahci nvme_core i2c_piix4 megaraid_sas xhci_pci_renesas wmi Apr 26 08:24:59 cloud kernel: [11694193.255373] CPU: 36 PID: 4020797 Comm: drbdsetup Tainted: G W OE 5.15.0-91-generic #101-Ubuntu Apr 26 08:24:59 cloud kernel: [11694193.255376] Hardware name: Dell Inc. PowerEdge R6515/035YY8, BIOS 2.12.4 07/27/2023 Apr 26 08:24:59 cloud kernel: [11694193.255378] RIP: 0010:bm_rw_range.constprop.0+0x4d5/0x570 [drbd] Apr 26 08:24:59 cloud kernel: [11694193.255393] Code: 10 e8 9f 14 a3 de e9 34 ff ff ff 4c 89 ff e8 52 f2 ff ff e9 06 fd ff ff be 03 00 00 00 4c 89 ff e8 80 2f a0 de e9 43 fd ff ff <0f> 0b e9 10 fc ff ff 45 31 f6 e9 bb fc ff ff 48 89 c6 48 c7 c7 40 Apr 26 08:24:59 cloud kernel: [11694193.255395] RSP: 0018:ffffbbcac635f928 EFLAGS: 00010246 Apr 26 08:24:59 cloud kernel: [11694193.255398] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 Apr 26 08:24:59 cloud kernel: [11694193.255400] RDX: 00000001ae419f6d RSI: ffffffffc0a50d54 RDI: 0000000000000009 Apr 26 08:24:59 cloud kernel: [11694193.255401] RBP: ffffbbcac635f980 R08: 0000000000000000 R09: 0000000000000009 Apr 26 08:24:59 cloud kernel: [11694193.255402] R10: ffff9402680d8800 R11: 0000000000000118 R12: ffff942aec9ccc00 Apr 26 08:24:59 cloud kernel: [11694193.255404] R13: ffff94147c6bd000 R14: 0000000080010001 R15: ffff942c0a52c000 Apr 26 08:24:59 cloud kernel: [11694193.255406] FS: 00007ff6af297740(0000) GS:ffff942cff100000(0000) knlGS:0000000000000000 Apr 26 08:24:59 cloud kernel: [11694193.255408] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Apr 26 08:24:59 cloud kernel: [11694193.255409] CR2: 00007ffccfa67f98 CR3: 000000390ff88000 CR4: 0000000000350ee0 Apr 26 08:24:59 cloud kernel: [11694193.255411] Call Trace: Apr 26 08:24:59 cloud kernel: [11694193.255413] Apr 26 08:24:59 cloud kernel: [11694193.255415] ? show_trace_log_lvl+0x1d6/0x2ea Apr 26 08:24:59 cloud kernel: [11694193.255421] ? show_trace_log_lvl+0x1d6/0x2ea Apr 26 08:24:59 cloud kernel: [11694193.255425] ? drbd_bm_write+0x15/0x20 [drbd] Apr 26 08:24:59 cloud kernel: [11694193.255440] ? show_regs.part.0+0x23/0x29 Apr 26 08:24:59 cloud kernel: [11694193.255443] ? show_regs.cold+0x8/0xd Apr 26 08:24:59 cloud kernel: [11694193.255446] ? bm_rw_range.constprop.0+0x4d5/0x570 [drbd] Apr 26 08:24:59 cloud kernel: [11694193.255461] ? warn+0x8c/0x100 Apr 26 08:24:59 cloud kernel: [11694193.255466] ? bm_rw_range.constprop.0+0x4d5/0x570 [drbd] Apr 26 08:24:59 cloud kernel: [11694193.255483] ? report_bug+0xa4/0xd0 Apr 26 08:24:59 cloud kernel: [11694193.255489] ? handle_bug+0x39/0x90 Apr 26 08:24:59 cloud kernel: [11694193.255493] ? exc_invalid_op+0x19/0x70 Apr 26 08:24:59 cloud kernel: [11694193.255496] ? asm_exc_invalid_op+0x1b/0x20 Apr 26 08:24:59 cloud kernel: [11694193.255502] ? bm_rw_range.constprop.0+0x64/0x570 [drbd] Apr 26 08:24:59 cloud kernel: [11694193.255520] ? bm_rw_range.constprop.0+0x4d5/0x570 [drbd] Apr 26 08:24:59 cloud kernel: [11694193.255537] ? bm_rw_range.constprop.0+0x64/0x570 [drbd] Apr 26 08:24:59 cloud kernel: [11694193.255555] drbd_bm_write+0x15/0x20 [drbd] Apr 26 08:24:59 cloud kernel: [11694193.255573] clear_peer_slot+0x1e1/0x300 [drbd] Apr 26 08:24:59 cloud kernel: [11694193.255596] ? kfree+0x1f7/0x250 Apr 26 08:24:59 cloud kernel: [11694193.255600] drbd_adm_peer_device_opts+0x414/0x600 [drbd] Apr 26 08:24:59 cloud kernel: [11694193.255622] genl_family_rcv_msg_doit+0xe7/0x150 Apr 26 08:24:59 cloud kernel: [11694193.255629] genl_rcv_msg+0xe2/0x1f0 Apr 26 08:24:59 cloud kernel: [11694193.255633] ? drbd_adm_del_minor+0xa0/0xa0 [drbd] Apr 26 08:24:59 cloud kernel: [11694193.255651] ? genl_get_cmd+0xe0/0xe0 Apr 26 08:24:59 cloud kernel: [11694193.255654] netlink_rcv_skb+0x56/0x100 Apr 26 08:24:59 cloud kernel: [11694193.255659] genl_rcv+0x29/0x40 Apr 26 08:24:59 cloud kernel: [11694193.255663] netlink_unicast+0x223/0x340 Apr 26 08:24:59 cloud kernel: [11694193.255667] netlink_sendmsg+0x24b/0x4c0 Apr 26 08:24:59 cloud kernel: [11694193.255671] sock_sendmsg+0x69/0x70 Apr 26 08:24:59 cloud kernel: [11694193.255677] sock_write_iter+0x93/0xf0 Apr 26 08:24:59 cloud kernel: [11694193.255682] new_sync_write+0x190/0x1a0 Apr 26 08:24:59 cloud kernel: [11694193.255687] vfs_write+0x1d5/0x270 Apr 26 08:24:59 cloud kernel: [11694193.255690] ksys_write+0xb5/0xf0 Apr 26 08:24:59 cloud kernel: [11694193.255694] x64_sys_write+0x19/0x20 Apr 26 08:24:59 cloud kernel: [11694193.255698] do_syscall_64+0x5c/0xc0 Apr 26 08:24:59 cloud kernel: [11694193.255700] ? irqentry_exit_to_user_mode+0x17/0x20 Apr 26 08:24:59 cloud kernel: [11694193.255704] ? irqentry_exit+0x1d/0x30 Apr 26 08:24:59 cloud kernel: [11694193.255707] ? exc_page_fault+0x89/0x170 Apr 26 08:24:59 cloud kernel: [11694193.255711] entry_SYSCALL_64_after_hwframe+0x62/0xcc Apr 26 08:24:59 cloud kernel: [11694193.255715] RIP: 0033:0x7ff6af3ae887 Apr 26 08:24:59 cloud kernel: [11694193.255718] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 Apr 26 08:24:59 cloud kernel: [11694193.255720] RSP: 002b:00007ffccfa68808 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 Apr 26 08:24:59 cloud kernel: [11694193.255723] RAX: ffffffffffffffda RBX: 0000000000000088 RCX: 00007ff6af3ae887 Apr 26 08:24:59 cloud kernel: [11694193.255725] RDX: 0000000000000088 RSI: 000055eccb76d310 RDI: 0000000000000004 Apr 26 08:24:59 cloud kernel: [11694193.255727] RBP: 000055eccb76d310 R08: 0000000000000001 R09: 0000000000000000 Apr 26 08:24:59 cloud kernel: [11694193.255728] R10: 000055eccacacc50 R11: 0000000000000246 R12: 0000000000000088 Apr 26 08:24:59 cloud kernel: [11694193.255730] R13: 0000000000000004 R14: 00007ffccfa688a0 R15: 000055eccac9e087 Apr 26 08:24:59 cloud kernel: [11694193.255733] Apr 26 08:24:59 cloud kernel: [11694193.255735] ---[ end trace 02b857fdd52ea5a7 ]---

This message is repeated several times and then the following message:

Apr 26 08:43:43 cloud kernel: [11695316.513414] watchdog: BUG: soft lockup - CPU#36 stuck for 470s! [rpc-libvirtd:3230760] Apr 26 08:43:43 cloud kernel: [11695316.513817] Modules linked in: sctp ip6t_REJECT nf_reject_ipv6 nfsv3 nfs_acl cpuid dm_mirror dm_region_hash dm_log xt_hl ip6_tables ip6t_rt xt_LOG n f_log_syslog vxlan ip6_udp_tunnel udp_tunnel bluetooth ecdh_generic ecc vhost_net vhost vhost_iotlb tap act_police cls_u32 sch_ingress cls_fw sch_sfq sch_htb dm_snapshot ipset rpcsec gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs xt_recent dm_writecache nvme_rdma nvmet_rdma nvmet rdma_cm iw_cm ib_cm nvme_fabrics ip_gre ip_tunnel gre 8021q garp mrp bonding ipt_REJECT nf_reject_ipv4 xt_multiport nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nft_counter ipmi_ssif sunrpc nf_tabl es binfmt_misc nfnetlink intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd dell_wmi ledtrig_audio kvm_amd sparse_keymap video kvm dell_smbios dcdbas rapl wmi_bmof dell_wmi_descr iptor joydev input_leds ccp ptdma k10temp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid Apr 26 08:43:43 cloud kernel: [11695316.513861] sch_fq_codel drbd_transport_tcp(OE) bcache crc64 drbd(OE) lru_cache br_netfilter bridge stp llc dm_multipath scsi_dh_rdac scsi_dh_emc s csi_dh_alua msr efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 r aid0 multipath linear mlx5_ib ib_uverbs ib_core hid_generic dm_cache_smq usbhid hid mgag200 dm_cache i2c_algo_bit drm_kms_helper dm_thin_pool syscopyarea dm_persistent_data sysfillrect sysimgblt fb_sys_fops dm_bio_prison crct10dif_pclmul cec crc32_pclmul dm_bufio ghash_clmulni_intel mlx5_core rc_core mlxfw libcrc32c aesni_intel psample crypto_simd nvme drm ahci xhci _pci tls cryptd tg3 pci_hyperv_intf libahci nvme_core i2c_piix4 megaraid_sas xhci_pci_renesas wmi Apr 26 08:43:43 cloud kernel: [11695316.520516] CPU: 36 PID: 3230760 Comm: rpc-libvirtd Tainted: G D W OEL 5.15.0-91-generic #101-Ubuntu Apr 26 08:43:43 cloud kernel: [11695316.521012] Hardware name: Dell Inc. PowerEdge R6515/035YY8, BIOS 2.12.4 07/27/2023 Apr 26 08:43:43 cloud kernel: [11695316.521516] RIP: 0010:smp_call_function_single+0xdf/0x120 Apr 26 08:43:43 cloud kernel: [11695316.522012] Code: 25 28 00 00 00 75 5a c9 e9 ce 09 e7 00 48 89 e6 48 89 54 24 18 4c 89 44 24 10 e8 2c fe ff ff 8b 54 24 08 83 e2 01 74 0b f3 90 <8b> 54 24 08 83 e2 01 75 f5 eb c2 9c 58 0f 1f 44 00 00 f6 c4 02 0f Apr 26 08:43:43 cloud kernel: [11695316.523077] RSP: 0018:ffffbbcacbc53ba0 EFLAGS: 00000202 Apr 26 08:43:43 cloud kernel: [11695316.523603] RAX: 0000000000000000 RBX: 00000037dfefd1a7 RCX: ffff93f571bf99b8 Apr 26 08:43:43 cloud kernel: [11695316.524130] RDX: 0000000000000001 RSI: ffffbbcacbc53ba0 RDI: ffffbbcacbc53ba0 Apr 26 08:43:43 cloud kernel: [11695316.524657] RBP: ffffbbcacbc53be8 R08: ffffffff9ee59e50 R09: 000000000830107a Apr 26 08:43:43 cloud kernel: [11695316.525186] R10: 0000000000ffff10 R11: 000000000000000f R12: 000000000001fbe0 Apr 26 08:43:43 cloud kernel: [11695316.525718] R13: 0000000000000001 R14: ffff942cfe840000 R15: 0000000000000001 Apr 26 08:43:43 cloud kernel: [11695316.526241] FS: 00007f91c2aeb640(0000) GS:ffff942cff100000(0000) knlGS:0000000000000000 Apr 26 08:43:43 cloud kernel: [11695316.526755] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Apr 26 08:43:43 cloud kernel: [11695316.527270] CR2: 00007fa734a5019c CR3: 000000299bb90000 CR4: 0000000000350ee0 Apr 26 08:43:43 cloud kernel: [11695316.527780] Call Trace: Apr 26 08:43:43 cloud kernel: [11695316.528267] Apr 26 08:43:43 cloud kernel: [11695316.528753] ? show_trace_log_lvl+0x1d6/0x2ea Apr 26 08:43:43 cloud kernel: [11695316.529242] ? show_trace_log_lvl+0x1d6/0x2ea Apr 26 08:43:43 cloud kernel: [11695316.529740] ? aperfmperf_snapshot_cpu+0x83/0xe0 Apr 26 08:43:43 cloud kernel: [11695316.530225] ? show_regs.part.0+0x23/0x29 Apr 26 08:43:43 cloud kernel: [11695316.530706] ? show_regs.cold+0x8/0xd Apr 26 08:43:43 cloud kernel: [11695316.531172] ? watchdog_timer_fn+0x1be/0x220 Apr 26 08:43:43 cloud kernel: [11695316.531669] ? lockup_detector_update_enable+0x60/0x60 Apr 26 08:43:43 cloud kernel: [11695316.532119] ? hrtimer_run_queues+0x107/0x230 Apr 26 08:43:43 cloud kernel: [11695316.532566] ? clockevents_program_event+0xad/0x130 Apr 26 08:43:43 cloud kernel: [11695316.533001] ? hrtimer_interrupt+0x101/0x220 Apr 26 08:43:43 cloud kernel: [11695316.533413] watchdog: BUG: soft lockup - CPU#43 stuck for 414s! [rpc-libvirtd:3230763] Apr 26 08:43:43 cloud kernel: [11695316.533429] ? __sysvec_apic_timer_interrupt+0x61/0xe0 Apr 26 08:43:43 cloud kernel: [11695316.534118] Modules linked in: sctp ip6t_REJECT Apr 26 08:43:43 cloud kernel: [11695316.534848] ? sysvec_apic_timer_interrupt+0x7b/0x90 Apr 26 08:43:43 cloud kernel: [11695316.535513] nf_reject_ipv6 Apr 26 08:43:43 cloud kernel: [11695316.536204] Apr 26 08:43:43 cloud kernel: [11695316.536205] nfsv3 Apr 26 08:43:43 cloud kernel: [11695316.536645] Apr 26 08:43:43 cloud kernel: [11695316.537263] nfs_acl cpuid Apr 26 08:43:43 cloud kernel: [11695316.537916] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 Apr 26 08:43:43 cloud kernel: [11695316.538515] dm_mirror Apr 26 08:43:43 cloud kernel: [11695316.539140] ? aperfmperf_snapshot_cpu+0xe0/0xe0 Apr 26 08:43:43 cloud kernel: [11695316.539742] dm_region_hash Apr 26 08:43:43 cloud kernel: [11695316.540368] ? smp_call_function_single+0xdf/0x120 Apr 26 08:43:43 cloud kernel: [11695316.540920] dm_log Apr 26 08:43:43 cloud kernel: [11695316.541470] ? aperfmperf_snapshot_cpu+0xe0/0xe0 Apr 26 08:43:43 cloud kernel: [11695316.542007] xt_hl Apr 26 08:43:43 cloud kernel: [11695316.542591] aperfmperf_snapshot_cpu+0x83/0xe0 Apr 26 08:43:43 cloud kernel: [11695316.543106] ip6_tables Apr 26 08:43:43 cloud kernel: [11695316.543709] ? recalibrate_cpu_khz+0x10/0x10 Apr 26 08:43:43 cloud kernel: [11695316.544200] ip6t_rt xt_LOG Apr 26 08:43:43 cloud kernel: [11695316.544748] aperfmperf_get_khz+0x56/0xa0 Apr 26 08:43:43 cloud kernel: [11695316.545220] nf_log_syslog vxlan Apr 26 08:43:43 cloud kernel: [11695316.545763] show_cpuinfo+0x400/0x5f0 Apr 26 08:43:43 cloud kernel: [11695316.546206] ip6_udp_tunnel udp_tunnel Apr 26 08:43:43 cloud kernel: [11695316.546733] ? cpumask_next+0x23/0x30 Apr 26 08:43:43 cloud kernel: [11695316.547162] bluetooth Apr 26 08:43:43 cloud kernel: [11695316.547660] seq_read_iter+0x2c8/0x4b0 Apr 26 08:43:43 cloud kernel: [11695316.548060] ecdh_generic ecc Apr 26 08:43:43 cloud kernel: [11695316.548553] proc_reg_read_iter+0x2f/0x90 Apr 26 08:43:43 cloud kernel: [11695316.548925] vhost_net Apr 26 08:43:43 cloud kernel: [11695316.549377] new_sync_read+0x10d/0x190 Apr 26 08:43:43 cloud kernel: [11695316.549771] vhost Apr 26 08:43:43 cloud kernel: [11695316.550214] vfs_read+0x103/0x1a0 Apr 26 08:43:43 cloud kernel: [11695316.550594] vhost_iotlb Apr 26 08:43:43 cloud kernel: [11695316.551021] ksys_read+0x67/0xf0 Apr 26 08:43:43 cloud kernel: [11695316.551370] tap act_police Apr 26 08:43:43 cloud kernel: [11695316.551791] x64_sys_read+0x19/0x20 Apr 26 08:43:43 cloud kernel: [11695316.552113] cls_u32 Apr 26 08:43:43 cloud kernel: [11695316.552527] do_syscall_64+0x5c/0xc0 Apr 26 08:43:43 cloud kernel: [11695316.552846] sch_ingress Apr 26 08:43:43 cloud kernel: [11695316.553260] ? exit_to_user_mode_loop+0x10d/0x160 Apr 26 08:43:43 cloud kernel: [11695316.553587] cls_fw Apr 26 08:43:43 cloud kernel: [11695316.553997] ? exit_to_user_mode_prepare+0x96/0xb0 Apr 26 08:43:43 cloud kernel: [11695316.554345] sch_sfq Apr 26 08:43:43 cloud kernel: [11695316.554762] ? syscall_exit_to_user_mode+0x35/0x50 Apr 26 08:43:43 cloud kernel: [11695316.555086] sch_htb Apr 26 08:43:43 cloud kernel: [11695316.555516] ? do_syscall_64+0x69/0xc0 Apr 26 08:43:43 cloud kernel: [11695316.555844] dm_snapshot Apr 26 08:43:43 cloud kernel: [11695316.556260] ? exit_to_user_mode_prepare+0x37/0xb0 Apr 26 08:43:43 cloud kernel: [11695316.556576] ip_set Apr 26 08:43:43 cloud kernel: [11695316.556966] ? syscall_exit_to_user_mode+0x35/0x50 Apr 26 08:43:43 cloud kernel: [11695316.557279] rpcsec_gss_krb5 Apr 26 08:43:43 cloud kernel: [11695316.557649] ? x64_sys_newuname+0x12/0x20 Apr 26 08:43:43 cloud kernel: [11695316.557959] auth_rpcgss Apr 26 08:43:43 cloud kernel: [11695316.558366] ? do_syscall_64+0x69/0xc0 Apr 26 08:43:43 cloud kernel: [11695316.558681] nfsv4 Apr 26 08:43:43 cloud kernel: [11695316.559085] ? do_syscall_64+0x69/0xc0 Apr 26 08:43:43 cloud kernel: [11695316.559420] nfs lockd Apr 26 08:43:43 cloud kernel: [11695316.559788] ? do_syscall_64+0x69/0xc0 Apr 26 08:43:43 cloud kernel: [11695316.560093] grace Apr 26 08:43:43 cloud kernel: [11695316.560459] ? do_syscall_64+0x69/0xc0 Apr 26 08:43:43 cloud kernel: [11695316.560764] fscache Apr 26 08:43:43 cloud kernel: [11695316.561123] entry_SYSCALL_64_after_hwframe+0x62/0xcc Apr 26 08:43:43 cloud kernel: [11695316.561429] netfs Apr 26 08:43:43 cloud kernel: [11695316.561788] RIP: 0033:0x7f91c638781c Apr 26 08:43:43 cloud kernel: [11695316.562097] xt_recent Apr 26 08:43:43 cloud kernel: [11695316.562465] Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 e9 c1 f7 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 2f c2 f7 ff 48 Apr 26 08:43:43 cloud kernel: [11695316.562771] dm_writecache Apr 26 08:43:43 cloud kernel: [11695316.563157] RSP: 002b:00007f91c2aea6e0 EFLAGS: 00000246 Apr 26 08:43:43 cloud kernel: [11695316.563745] nvme_rdma Apr 26 08:43:43 cloud kernel: [11695316.564189] ORIG_RAX: 0000000000000000 Apr 26 08:43:43 cloud kernel: [11695316.564546] nvmet_rdma Apr 26 08:43:43 cloud kernel: [11695316.564955] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f91c638781c Apr 26 08:43:43 cloud kernel: [11695316.565314] nvmet rdma_cm Apr 26 08:43:43 cloud kernel: [11695316.565751] RDX: 0000000000001000 RSI: 00007f91bc2fc2a0 RDI: 0000000000000022 Apr 26 08:43:43 cloud kernel: [11695316.566123] iw_cm ib_cm Apr 26 08:43:43 cloud kernel: [11695316.566579] RBP: 00007f91bc2fc2a0 R08: 0000000000000000 R09: 00007f91bc2fc2a0 Apr 26 08:43:43 cloud kernel: [11695316.566959] nvme_fabrics Apr 26 08:43:43 cloud kernel: [11695316.567417] R10: 00007f91bc001c70 R11: 0000000000000246 R12: 0000000000001000 Apr 26 08:43:43 cloud kernel: [11695316.567812] ip_gre Apr 26 08:43:43 cloud kernel: [11695316.568252] R13: 0000000000000022 R14: 0000000000000000 R15: 0000000000000000 Apr 26 08:43:43 cloud kernel: [11695316.568653] ip_tunnel Apr 26 08:43:43 cloud kernel: [11695316.569106] Apr 26 08:43:43 cloud kernel: [11695316.569528] gre 8021q garp mrp bonding ipt_REJECT nf_reject_ipv4 xt_multiport nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nft_counter ipmi_ssif sunrpc nf_tables binfmt_misc nfnetlink intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd dell_wmi ledtrig_audio kvm_amd sparse_keymap video kvm dell_smbios dcdbas rapl wmi_bmof dell_wmi_descriptor joydev input_leds ccp ptdma k10temp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel drbd_transport_tcp(OE) bcache crc64 drbd(OE) lru_cache br_netfilter bridge stp llc dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core hid_generic dm_cache_smq usbhid hid mgag200 dm_cache i2c_algo_bit drm_kms_helper dm_thin_pool syscopyarea Apr 26 08:43:43 cloud kernel: [11695316.570454] dm_persistent_data sysfillrect sysimgblt fb_sys_fops dm_bio_prison crct10dif_pclmul cec crc32_pclmul dm_bufio ghash_clmulni_intel mlx5_core rc_core mlxfw libcrc32c aesni_intel psample crypto_simd nvme drm ahci xhci_pci tls cryptd tg3 pci_hyperv_intf libahci nvme_core i2c_piix4 megaraid_sas xhci_pci_renesas wmi Apr 26 08:43:43 cloud kernel: [11695316.574237] CPU: 43 PID: 3230763 Comm: rpc-libvirtd Tainted: G D W OEL 5.15.0-91-generic #101-Ubuntu Apr 26 08:43:43 cloud kernel: [11695316.574645] Hardware name: Dell Inc. PowerEdge R6515/035YY8, BIOS 2.12.4 07/27/2023 Apr 26 08:43:43 cloud kernel: [11695316.575031] RIP: 0010:smp_call_function_single+0xdf/0x120 Apr 26 08:43:43 cloud kernel: [11695316.575422] Code: 25 28 00 00 00 75 5a c9 e9 ce 09 e7 00 48 89 e6 48 89 54 24 18 4c 89 44 24 10 e8 2c fe ff ff 8b 54 24 08 83 e2 01 74 0b f3 90 <8b> 54 24 08 83 e2 01 75 f5 eb c2 9c 58 0f 1f 44 00 00 f6 c4 02 0f Apr 26 08:43:43 cloud kernel: [11695316.576216] RSP: 0018:ffffbbcac8043be0 EFLAGS: 00000202 Apr 26 08:43:43 cloud kernel: [11695316.576613] RAX: 0000000000000000 RBX: 00000045d836a126 RCX: ffff93f571bfe638 Apr 26 08:43:43 cloud kernel: [11695316.577014] RDX: 0000000000000001 RSI: ffffbbcac8043be0 RDI: ffffbbcac8043be0 Apr 26 08:43:43 cloud kernel: [11695316.577419] RBP: ffffbbcac8043c20 R08: ffffffff9ee59e50 R09: 000000000830107a Apr 26 08:43:43 cloud kernel: [11695316.577825] R10: 0000000000ffff10 R11: 000000000000000f R12: 000000000001fbe0 Apr 26 08:43:43 cloud kernel: [11695316.578236] R13: 0000000000000001 R14: ffff942cfe840000 R15: 0000000000000001 Apr 26 08:43:43 cloud kernel: [11695316.578646] FS: 00007f91c12e8640(0000) GS:ffff942cff2c0000(0000) knlGS:0000000000000000 Apr 26 08:43:43 cloud kernel: [11695316.579060] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Apr 26 08:43:43 cloud kernel: [11695316.579475] CR2: 00007f2a2486f620 CR3: 000000299bb90000 CR4: 0000000000350ee0 Apr 26 08:43:43 cloud kernel: [11695316.579896] Call Trace: Apr 26 08:43:43 cloud kernel: [11695316.580311] Apr 26 08:43:43 cloud kernel: [11695316.580725] ? show_trace_log_lvl+0x1d6/0x2ea Apr 26 08:43:43 cloud kernel: [11695316.581140] ? show_trace_log_lvl+0x1d6/0x2ea Apr 26 08:43:43 cloud kernel: [11695316.581566] ? aperfmperf_snapshot_cpu+0x83/0xe0 Apr 26 08:43:43 cloud kernel: [11695316.581962] ? show_regs.part.0+0x23/0x29 Apr 26 08:43:43 cloud kernel: [11695316.582375] ? show_regs.cold+0x8/0xd Apr 26 08:43:43 cloud kernel: [11695316.582764] ? watchdog_timer_fn+0x1be/0x220 Apr 26 08:43:43 cloud kernel: [11695316.583141] ? lockup_detector_update_enable+0x60/0x60 Apr 26 08:43:43 cloud kernel: [11695316.583518] ? __hrtimer_run_queues+0x107/0x230 Apr 26 08:43:43 cloud kernel: [11695316.583896] ? clockevents_program_event+0xad/0x130 Apr 26 08:43:43 cloud kernel: [11695316.584277] ? hrtimer_interrupt+0x101/0x220 Apr 26 08:43:43 cloud kernel: [11695316.584654] ? sysvec_apic_timer_interrupt+0x61/0xe0 Apr 26 08:43:43 cloud kernel: [11695316.585029] ? sysvec_apic_timer_interrupt+0x7b/0x90 Apr 26 08:43:43 cloud kernel: [11695316.585401] Apr 26 08:43:43 cloud kernel: [11695316.585767] Apr 26 08:43:43 cloud kernel: [11695316.586127] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 Apr 26 08:43:43 cloud kernel: [11695316.586490] ? aperfmperf_snapshot_cpu+0xe0/0xe0 Apr 26 08:43:43 cloud kernel: [11695316.586850] ? smp_call_function_single+0xdf/0x120 Apr 26 08:43:43 cloud kernel: [11695316.587207] ? aperfmperf_snapshot_cpu+0xe0/0xe0 Apr 26 08:43:43 cloud kernel: [11695316.587608] aperfmperf_snapshot_cpu+0x83/0xe0 Apr 26 08:43:43 cloud kernel: [11695316.587958] ? recalibrate_cpu_khz+0x10/0x10 Apr 26 08:43:43 cloud kernel: [11695316.588311] aperfmperf_get_khz+0x56/0xa0 Apr 26 08:43:43 cloud kernel: [11695316.588653] show_cpuinfo+0x400/0x5f0 Apr 26 08:43:43 cloud kernel: [11695316.588987] ? cpumask_next+0x23/0x30 Apr 26 08:43:43 cloud kernel: [11695316.589314] seq_read_iter+0x2c8/0x4b0 Apr 26 08:43:43 cloud kernel: [11695316.589642] proc_reg_read_iter+0x2f/0x90 Apr 26 08:43:43 cloud kernel: [11695316.589949] new_sync_read+0x10d/0x190 Apr 26 08:43:43 cloud kernel: [11695316.590246] vfs_read+0x103/0x1a0 Apr 26 08:43:43 cloud kernel: [11695316.590558] ksys_read+0x67/0xf0 Apr 26 08:43:43 cloud kernel: [11695316.590835] x64_sys_read+0x19/0x20 Apr 26 08:43:43 cloud kernel: [11695316.591101] do_syscall_64+0x5c/0xc0 Apr 26 08:43:43 cloud kernel: [11695316.591357] ? switch_fpu_return+0x4e/0xc0 Apr 26 08:43:43 cloud kernel: [11695316.591607] ? exit_to_user_mode_prepare+0x96/0xb0 Apr 26 08:43:43 cloud kernel: [11695316.591870] ? syscall_exit_to_user_mode+0x35/0x50 Apr 26 08:43:43 cloud kernel: [11695316.592142] ? do_syscall_64+0x69/0xc0 Apr 26 08:43:43 cloud kernel: [11695316.592374] ? do_syscall_64+0x69/0xc0 Apr 26 08:43:43 cloud kernel: [11695316.592594] ? x64_sys_write+0x19/0x20 Apr 26 08:43:43 cloud kernel: [11695316.592805] ? do_syscall_64+0x69/0xc0 Apr 26 08:43:43 cloud kernel: [11695316.593009] ? do_syscall_64+0x69/0xc0 Apr 26 08:43:43 cloud kernel: [11695316.593208] ? do_syscall_64+0x69/0xc0 Apr 26 08:43:43 cloud kernel: [11695316.593401] entry_SYSCALL_64_after_hwframe+0x62/0xcc Apr 26 08:43:43 cloud kernel: [11695316.593598] RIP: 0033:0x7f91c638781c Apr 26 08:43:43 cloud kernel: [11695316.593790] Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 e9 c1 f7 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 2f c2 f7 ff 48 Apr 26 08:43:43 cloud kernel: [11695316.594203] RSP: 002b:00007f91c12e76e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 Apr 26 08:43:43 cloud kernel: [11695316.594435] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f91c638781c Apr 26 08:43:43 cloud kernel: [11695316.594649] RDX: 0000000000001000 RSI: 00007f91ac04e0b0 RDI: 0000000000000023 Apr 26 08:43:43 cloud kernel: [11695316.594856] RBP: 00007f91ac04e0b0 R08: 0000000000000000 R09: 00007f91ac04e0b0 Apr 26 08:43:43 cloud kernel: [11695316.595067] R10: 00007f91ac000250 R11: 0000000000000246 R12: 0000000000001000 Apr 26 08:43:43 cloud kernel: [11695316.595283] R13: 0000000000000023 R14: 0000000000000000 R15: 0000000000000000 Apr 26 08:43:43 cloud kernel: [11695316.595539]

In the end, there is no other option but to reboot the node. The same situation has happened several times with other pool members. Only linstore node restore and deactivation of resources to stop replication linstor r deact helped to stabilise the situation, because, in my opinion, it was the active replication of resources that caused this problem. Could you please tell me what could be the cause of this problem and how to properly shut down the server with correct migration of resources that are deployed on it?

ghernadi commented 6 months ago

Hello, thanks for the report, although this is more a DRBD issue than related to LINSTOR.

The first block is "just a warning", that should have been fixed in a more recent version of DRBD. The other stack traces is something from libvirtd, so neither DRBD nor LINSTOR - at least nothing directly. It could be the case that those "quite noisy" DRBD warnings could have contributed somehow to this libvirtd behavior, that is hard to tell.

I will however close this issue now, since this is not an issue with LINSTOR (i.e. wrong project). Please try to upgrade DRBD and if this issue still persists, please reopen a new issue in the DRBD project