LINBIT / drbd

LINBIT DRBD kernel module
https://docs.linbit.com/docs/users-guide-9.0/
GNU General Public License v2.0
574 stars 97 forks source link

drbd with rdma gets stuck when shutting down a resource #58

Open wzrdtales opened 1 year ago

wzrdtales commented 1 year ago

update:

https://github.com/LINBIT/drbd/blob/460cfc1025ba5abb63ac9ed2895d6cec178bb39c/drbd/drbd_transport_rdma.c#L572

this is the line it gets stuck.

drbd with rdma gets stuck when disconnecting a resource in sync

here are the logs we could retrieve:

n 12 15:24:16 os-hv-am-4 kernel: [  232.569733] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: helper command: /sbin/drbdadm initial-split-brain
Jun 12 15:24:16 os-hv-am-4 kernel: [  232.571421] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: helper command: /sbin/drbdadm initial-split-brain exit code 0
Jun 12 15:24:16 os-hv-am-4 kernel: [  232.571439] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: Split-Brain detected but unresolved, dropping connection!
Jun 12 15:24:16 os-hv-am-4 kernel: [  232.572513] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: helper command: /sbin/drbdadm split-brain
Jun 12 15:24:16 os-hv-am-4 kernel: [  232.574101] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: helper command: /sbin/drbdadm split-brain exit code 0
Jun 12 15:24:16 os-hv-am-4 kernel: [  232.574129] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9: Aborting cluster-wide state change 3531582257 (20ms) rv = -27
Jun 12 15:24:16 os-hv-am-4 kernel: [  232.574141] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: conn( Connecting -> Disconnecting )
Jun 12 15:24:16 os-hv-am-4 kernel: [  232.574164] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: sock_recvmsg returned -4
Jun 12 15:24:16 os-hv-am-4 kernel: [  232.574242] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Terminating sender thread
Jun 12 15:24:16 os-hv-am-4 kernel: [  232.574260] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Starting sender thread (from drbd_r_vm-57da0 [3519])
Jun 12 15:24:16 os-hv-am-4 kernel: [  232.576227] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: pp_use = 1307, expected 0
Jun 12 15:24:16 os-hv-am-4 kernel: [  232.576229] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Connection closed
Jun 12 15:24:16 os-hv-am-4 kernel: [  232.576239] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: helper command: /sbin/drbdadm disconnected
Jun 12 15:24:16 os-hv-am-4 kernel: [  232.577812] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: helper command: /sbin/drbdadm disconnected exit code 0
Jun 12 15:24:16 os-hv-am-4 kernel: [  232.577830] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: conn( Disconnecting -> StandAlone )
Jun 12 15:24:16 os-hv-am-4 kernel: [  232.577834] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Terminating receiver thread
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.140302] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: conn( StandAlone -> Unconnected )
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.140350] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Starting receiver thread (from drbd_w_vm-57da0 [3500])
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.140513] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: conn( Unconnected -> Connecting )
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.148270] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Handshake to peer 1 successful: Agreed network protocol version 121
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.148273] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Feature flags enabled on protocol level: 0x3f TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES RESYNC_DAGTAG
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.148316] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Peer authenticated using 20 bytes HMAC
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.157481] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9: Preparing cluster-wide state change 105203894 (0->1 499/145)
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.193492] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: drbd_sync_handshake:
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.193497] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: self 78ECCE9CE229714C:20F264F5A750B1D5:23B3AB54E1225FE6:940F040BBBAE35D4 bits:1264914 flags:22
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.193500] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: peer 5B8C3B4DEBC7DD0A:20F264F5A750B1D4:9AD186D68FBE0520:0000000000000000 bits:1264658 flags:1025
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.193502] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: uuid_compare()=split-brain-auto-recover by rule=bitmap-both
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.193512] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: helper command: /sbin/drbdadm initial-split-brain
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.195114] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: helper command: /sbin/drbdadm initial-split-brain exit code 0
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.195125] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: Split-Brain detected, manually solved. Sync from this node
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.195139] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9: State change 105203894: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.195142] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9: Committing cluster-wide state change 105203894 (36ms)
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.195155] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.195157] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: pdsk( Outdated -> Inconsistent ) repl( Off -> WFBitMapS )
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.200092] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 494(1), total 494; compression: 100.0%
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.205201] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 494(1), total 494; compression: 100.0%
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.205214] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: helper command: /sbin/drbdadm before-resync-source
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.206764] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: helper command: /sbin/drbdadm before-resync-source exit code 0
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.206775] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: repl( WFBitMapS -> SyncSource )
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.206824] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: Began resync as SyncSource (will sync 5059656 KB [1264914 bits set]).
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207282] ------------[ cut here ]------------
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207293] WARNING: CPU: 114 PID: 0 at kernel/softirq.c:168 __local_bh_enable_ip+0x3a/0x60
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207294] Modules linked in: drbd_transport_rdma(OE) nvmet_rdma nvmet drbd(OE) lru_cache nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter aufs overlay nls_iso8859_1 ipmi_ssif amd64_edac_mod rpcrdma edac_mce_amd sunrpc rdma_ucm kvm_amd ib_iser kvm rdma_cm iw_cm crct10dif_pclmul libiscsi ib_umad ghash_clmulni_intel scsi_transport_iscsi ib_ipoib aesni_intel ib_cm crypto_simd cryptd glue_helper dell_smbios dcdbas mgag200 drm_vram_helper dell_wmi_descriptor wmi_bmof ttm drm_kms_helper i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt ccp k10temp ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid bridge stp llc bonding sch_fq_codel ramoops reed_solomon drm efi_pstore ip_tables x_tables autofs4 raid10 raid456
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207334]  async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear mlx4_ib mlx4_en mlx5_ib ib_uverbs raid1 ib_core crc32_pclmul mlx5_core mlx4_core tg3 ahci nvme pci_hyperv_intf libahci tls nvme_core mlxfw i2c_piix4 wmi
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207348] CPU: 114 PID: 0 Comm: swapper/114 Tainted: G        W  OE     5.4.0-139-generic #156-Ubuntu
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207349] Hardware name: Dell Inc. PowerEdge R6515/0R4CNN, BIOS 2.7.3 03/31/2022
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207353] RIP: 0010:__local_bh_enable_ip+0x3a/0x60
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207355] Code: a9 00 00 0f 00 75 23 83 ee 01 f7 de 65 01 35 05 af 37 7e 65 8b 05 fe ae 37 7e a9 00 ff 1f 00 74 0d 5d 65 ff 0d ef ae 37 7e c3 <0f> 0b eb d9 65 66 8b 05 da e9 38 7e 66 85 c0 74 e6 e8 60 ff ff ff
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207357] RSP: 0018:ffffab6e41e18c48 EFLAGS: 00010006
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207358] RAX: 0000000080010200 RBX: ffff8a5c2380c800 RCX: 0000000000000000
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207359] RDX: 0000000000000001 RSI: 0000000000000200 RDI: ffffffffc0d0ab62
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207360] RBP: ffffab6e41e18c48 R08: ffff8a5c18528b68 R09: ffff8a5c18528b90
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207361] R10: 0000000000000000 R11: ffff8a5bfd7e0060 R12: 0000000000000001
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207362] R13: ffff8a5c2380c858 R14: 0000000000000000 R15: ffff8a5b7ebee010
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207364] FS:  0000000000000000(0000) GS:ffff8a5c3f080000(0000) knlGS:0000000000000000
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207365] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207366] CR2: 000055e261248048 CR3: 000000bd5ca66000 CR4: 0000000000340ee0
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207367] Call Trace:
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207368]  <IRQ>
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207374]  _raw_spin_unlock_bh+0x1e/0x20
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207389]  update_peer_seq+0x62/0x80 [drbd]
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207401]  got_BlockAck+0x63/0x2c0 [drbd]
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207412]  drbd_control_data_ready+0x8b/0x1e0 [drbd]
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207417]  dtr_control_data_ready.isra.0+0x9b/0x100 [drbd_transport_rdma]
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207420]  dtr_rx_cq_event_handler+0x2a9/0x640 [drbd_transport_rdma]
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207432]  mlx5_ib_cq_comp+0x26/0x30 [mlx5_ib]
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207457]  mlx5_eq_comp_int+0xa1/0x180 [mlx5_core]
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207461]  notifier_call_chain+0x55/0x80
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207464]  atomic_notifier_call_chain+0x1a/0x20
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207494]  mlx5_irq_int_handler+0x15/0x20 [mlx5_core]
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207496]  __handle_irq_event_percpu+0x42/0x180
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207499]  handle_irq_event_percpu+0x33/0x80
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207501]  handle_irq_event+0x3b/0x60
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207503]  handle_edge_irq+0x93/0x1c0
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207505]  do_IRQ+0x55/0xf0
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207507]  common_interrupt+0xf/0xf
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207508]  </IRQ>
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207511] RIP: 0010:cpuidle_enter_state+0xc5/0x450
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207513] Code: ff e8 1f f7 83 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 65 03 00 00 31 ff e8 92 0c 8a ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 88 8f 02 00 00 49 63 cd 4c 8b 7d d0 4c 2b 7d c8 48 8d
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207514] RSP: 0018:ffffab6e4068fe38 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd9
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207516] RAX: ffff8a5c3f0afe80 RBX: ffffffff83769620 RCX: 000000000000001f
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207516] RDX: 0000000000000000 RSI: 0000000038feef44 RDI: 0000000000000000
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207517] RBP: ffffab6e4068fe78 R08: 0000003c958fd955 R09: 0000003cc5b70acd
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207518] R10: ffff8a5c3f0aeb80 R11: ffff8a5c3f0aeb60 R12: ffff8a5bfea05800
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207519] R13: 0000000000000002 R14: 0000000000000002 R15: ffff8a5bfea05800
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207523]  ? cpuidle_enter_state+0xa1/0x450
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207525]  cpuidle_enter+0x2e/0x40
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207527]  call_cpuidle+0x23/0x40
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207529]  do_idle+0x1dd/0x270
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207532]  cpu_startup_entry+0x20/0x30
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207534]  start_secondary+0x167/0x1c0
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207537]  secondary_startup_64+0xa4/0xb0
Jun 12 15:24:44 os-hv-am-4 kernel: [  260.207539] ---[ end trace 4c646c84e9801deb ]---
Jun 12 15:25:44 os-hv-am-4 kernel: [  320.872034] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Preparing remote state change 1094725292
Jun 12 15:25:44 os-hv-am-4 kernel: [  320.872097] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Committing remote state change 1094725292 (primary_nodes=1)
Jun 12 15:25:44 os-hv-am-4 kernel: [  320.872103] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: conn( Connected -> TearDown ) peer( Secondary -> Unknown )
Jun 12 15:25:44 os-hv-am-4 kernel: [  320.872105] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: repl( SyncSource -> Off )
Jun 12 15:25:44 os-hv-am-4 kernel: [  320.872185] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9/0 drbd8 os-hv-am-3: Sending resync reply failed
Jun 12 15:25:44 os-hv-am-4 kernel: [  320.872415] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Terminating sender thread
Jun 12 15:25:44 os-hv-am-4 kernel: [  320.872992] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Starting sender thread (from drbd_r_vm-57da0 [3539])
Jun 12 15:25:45 os-hv-am-4 kernel: [  321.875133] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 rdma:os-hv-am-3: WARN: not properly disconnected, state = 6
Jun 12 15:25:45 os-hv-am-4 kernel: [  321.875779] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: pp_use = 2609, expected 0
Jun 12 15:25:45 os-hv-am-4 kernel: [  321.875781] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Connection closed
Jun 12 15:25:45 os-hv-am-4 kernel: [  321.875793] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: helper command: /sbin/drbdadm disconnected
Jun 12 15:25:45 os-hv-am-4 kernel: [  321.877358] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: helper command: /sbin/drbdadm disconnected exit code 0
Jun 12 15:25:45 os-hv-am-4 kernel: [  321.877372] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: conn( TearDown -> Unconnected )
Jun 12 15:25:45 os-hv-am-4 kernel: [  321.877382] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Restarting receiver thread
Jun 12 15:25:45 os-hv-am-4 kernel: [  321.877386] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: conn( Unconnected -> Connecting )
Jun 12 15:25:52 os-hv-am-4 kernel: [  328.185970] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: conn( Connecting -> Disconnecting )
Jun 12 15:25:52 os-hv-am-4 kernel: [  328.186035] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Failed to initiate connection, err=-512
Jun 12 15:25:52 os-hv-am-4 kernel: [  328.186062] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Terminating sender thread
Jun 12 15:25:52 os-hv-am-4 kernel: [  328.186066] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Starting sender thread (from drbd_r_vm-57da0 [3539])
Jun 12 15:25:52 os-hv-am-4 kernel: [  328.186473] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: pp_use = 2610, expected 0
Jun 12 15:25:52 os-hv-am-4 kernel: [  328.186475] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Connection closed
Jun 12 15:25:52 os-hv-am-4 kernel: [  328.186484] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: helper command: /sbin/drbdadm disconnected
Jun 12 15:25:52 os-hv-am-4 kernel: [  328.188149] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: helper command: /sbin/drbdadm disconnected exit code 0
Jun 12 15:25:52 os-hv-am-4 kernel: [  328.188176] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: conn( Disconnecting -> StandAlone )
Jun 12 15:25:52 os-hv-am-4 kernel: [  328.188180] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Terminating receiver thread
Jun 12 15:26:02 os-hv-am-4 kernel: [  338.882721] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9: Preparing cluster-wide state change 2022799093 (0->-1 3/2)
Jun 12 15:26:02 os-hv-am-4 kernel: [  338.882724] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9: Committing cluster-wide state change 2022799093 (0ms)
Jun 12 15:26:02 os-hv-am-4 kernel: [  338.882728] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9: role( Primary -> Secondary )
Jun 12 15:26:02 os-hv-am-4 kernel: [  338.883097] drbd vm-57da08a3-2938-4769-8959-3a11616d6ef9 os-hv-am-3: Terminating sender thread
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615298] INFO: task drbdsetup:3646 blocked for more than 120 seconds.
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615345]       Tainted: G        W  OE     5.4.0-139-generic #156-Ubuntu
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615382] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615424] drbdsetup       D    0  3646      1 0x00004004
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615427] Call Trace:
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615438]  __schedule+0x2e3/0x740
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615441]  schedule+0x42/0xb0
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615450]  dtr_free+0x315/0x350 [drbd_transport_rdma]
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615455]  ? __wake_up_pollfree+0x40/0x40
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615473]  drbd_transport_shutdown+0x64/0xa0 [drbd]
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615487]  drbd_unregister_connection+0x12e/0x220 [drbd]
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615501]  del_connection+0x95/0x150 [drbd]
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615503]  ? _cond_resched+0x19/0x30
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615515]  drbd_adm_down+0xee/0x350 [drbd]
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615518]  ? remove_wait_queue+0x47/0x50
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615523]  genl_family_rcv_msg+0x1b9/0x470
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615526]  genl_rcv_msg+0x4c/0xa0
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615528]  ? _cond_resched+0x19/0x30
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615529]  ? genl_family_rcv_msg+0x470/0x470
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615531]  netlink_rcv_skb+0x50/0x120
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615533]  genl_rcv+0x29/0x40
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615535]  netlink_unicast+0x1a8/0x250
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615537]  netlink_sendmsg+0x240/0x480
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615540]  ? mem_cgroup_commit_charge+0x63/0x490
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615545]  sock_sendmsg+0x65/0x70
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615547]  sock_write_iter+0x93/0xf0
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615551]  new_sync_write+0x125/0x1c0
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615554]  __vfs_write+0x29/0x40
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615556]  vfs_write+0xb9/0x1a0
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615558]  ksys_write+0x67/0xe0
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615560]  __x64_sys_write+0x1a/0x20
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615564]  do_syscall_64+0x57/0x190
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615566]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615569] RIP: 0033:0x7f08fa258077
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615575] Code: Bad RIP value.
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615576] RSP: 002b:00007ffd3fb2c108 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615579] RAX: ffffffffffffffda RBX: 000000000000004c RCX: 00007f08fa258077
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615580] RDX: 000000000000004c RSI: 0000564099f3dd40 RDI: 0000000000000004
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615581] RBP: 0000564099f3dd40 R08: 00007ffd3fb2c184 R09: 0000000000000000
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615582] R10: 0000564099f3b010 R11: 0000000000000246 R12: 000000000000004c
Jun 12 15:28:33 os-hv-am-4 kernel: [  489.615583] R13: 0000000000000004 R14: 00000000ffffffff R15: 0000000000000000
Jun 12 15:28:43 os-hv-am-4 kernel: [  499.911721] sysrq: Emergency Sync
Jun 12 15:28:43 os-hv-am-4 kernel: [  499.913650] Emergency Sync complete
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443007] INFO: task drbdsetup:3646 blocked for more than 241 seconds.
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443056]       Tainted: G        W  OE     5.4.0-139-generic #156-Ubuntu
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443093] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443134] drbdsetup       D    0  3646      1 0x00004004
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443137] Call Trace:
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443147]  __schedule+0x2e3/0x740
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443150]  schedule+0x42/0xb0
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443159]  dtr_free+0x315/0x350 [drbd_transport_rdma]
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443164]  ? __wake_up_pollfree+0x40/0x40
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443182]  drbd_transport_shutdown+0x64/0xa0 [drbd]
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443196]  drbd_unregister_connection+0x12e/0x220 [drbd]
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443209]  del_connection+0x95/0x150 [drbd]
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443212]  ? _cond_resched+0x19/0x30
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443224]  drbd_adm_down+0xee/0x350 [drbd]
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443227]  ? remove_wait_queue+0x47/0x50
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443232]  genl_family_rcv_msg+0x1b9/0x470
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443235]  genl_rcv_msg+0x4c/0xa0
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443237]  ? _cond_resched+0x19/0x30
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443238]  ? genl_family_rcv_msg+0x470/0x470
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443240]  netlink_rcv_skb+0x50/0x120
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443242]  genl_rcv+0x29/0x40
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443244]  netlink_unicast+0x1a8/0x250
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443246]  netlink_sendmsg+0x240/0x480
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443249]  ? mem_cgroup_commit_charge+0x63/0x490
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443254]  sock_sendmsg+0x65/0x70
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443256]  sock_write_iter+0x93/0xf0
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443260]  new_sync_write+0x125/0x1c0
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443263]  __vfs_write+0x29/0x40
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443265]  vfs_write+0xb9/0x1a0
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443267]  ksys_write+0x67/0xe0
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443269]  __x64_sys_write+0x1a/0x20
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443272]  do_syscall_64+0x57/0x190
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443275]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443277] RIP: 0033:0x7f08fa258077
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443285] Code: Bad RIP value.
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443287] RSP: 002b:00007ffd3fb2c108 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443289] RAX: ffffffffffffffda RBX: 000000000000004c RCX: 00007f08fa258077
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443290] RDX: 000000000000004c RSI: 0000564099f3dd40 RDI: 0000000000000004
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443291] RBP: 0000564099f3dd40 R08: 00007ffd3fb2c184 R09: 0000000000000000
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443292] R10: 0000564099f3b010 R11: 0000000000000246 R12: 000000000000004c
Jun 12 15:30:34 os-hv-am-4 kernel: [  610.443293] R13: 0000000000000004 R14: 00000000ffffffff R15: 0000000000000000
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267490] INFO: task drbdsetup:3646 blocked for more than 362 seconds.
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267539]       Tainted: G        W  OE     5.4.0-139-generic #156-Ubuntu
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267576] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267617] drbdsetup       D    0  3646      1 0x00004004
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267621] Call Trace:
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267632]  __schedule+0x2e3/0x740
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267635]  schedule+0x42/0xb0
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267644]  dtr_free+0x315/0x350 [drbd_transport_rdma]
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267648]  ? __wake_up_pollfree+0x40/0x40
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267667]  drbd_transport_shutdown+0x64/0xa0 [drbd]
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267680]  drbd_unregister_connection+0x12e/0x220 [drbd]
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267694]  del_connection+0x95/0x150 [drbd]
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267697]  ? _cond_resched+0x19/0x30
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267710]  drbd_adm_down+0xee/0x350 [drbd]
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267712]  ? remove_wait_queue+0x47/0x50
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267717]  genl_family_rcv_msg+0x1b9/0x470
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267720]  genl_rcv_msg+0x4c/0xa0
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267722]  ? _cond_resched+0x19/0x30
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267724]  ? genl_family_rcv_msg+0x470/0x470
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267725]  netlink_rcv_skb+0x50/0x120
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267728]  genl_rcv+0x29/0x40
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267729]  netlink_unicast+0x1a8/0x250
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267731]  netlink_sendmsg+0x240/0x480
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267735]  ? mem_cgroup_commit_charge+0x63/0x490
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267739]  sock_sendmsg+0x65/0x70
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267742]  sock_write_iter+0x93/0xf0
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267746]  new_sync_write+0x125/0x1c0
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267749]  __vfs_write+0x29/0x40
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267750]  vfs_write+0xb9/0x1a0
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267752]  ksys_write+0x67/0xe0
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267754]  __x64_sys_write+0x1a/0x20
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267757]  do_syscall_64+0x57/0x190
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267760]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267763] RIP: 0033:0x7f08fa258077
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267767] Code: Bad RIP value.
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267769] RSP: 002b:00007ffd3fb2c108 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267771] RAX: ffffffffffffffda RBX: 000000000000004c RCX: 00007f08fa258077
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267772] RDX: 000000000000004c RSI: 0000564099f3dd40 RDI: 0000000000000004
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267773] RBP: 0000564099f3dd40 R08: 00007ffd3fb2c184 R09: 0000000000000000
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267774] R10: 0000564099f3b010 R11: 0000000000000246 R12: 000000000000004c
Jun 12 15:32:35 os-hv-am-4 kernel: [  731.267775] R13: 0000000000000004 R14: 00000000ffffffff R15: 0000000000000000
Jun 12 15:32:45 os-hv-am-4 kernel: [  741.203227] SGI XFS with ACLs, security attributes, realtime, no debug enabled
Jun 12 15:32:45 os-hv-am-4 kernel: [  741.212617] JFS: nTxBlock = 8192, nTxLock = 65536
Jun 12 15:32:45 os-hv-am-4 kernel: [  741.233034] ntfs: driver 2.1.32 [Flags: R/O MODULE].
Jun 12 15:32:45 os-hv-am-4 kernel: [  741.256163] QNX4 filesystem 0.2.3 registered.
Jun 12 15:32:45 os-hv-am-4 kernel: [  741.294546] Btrfs loaded, crc32c=crc32c-intel
Jun 12 15:32:46 os-hv-am-4 kernel: [  742.073017] device-mapper: table: 253:4: linear: Device lookup failed
Jun 12 15:32:46 os-hv-am-4 kernel: [  742.073492] device-mapper: ioctl: error adding target to table
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093603] INFO: task drbdsetup:3646 blocked for more than 483 seconds.
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093654]       Tainted: G        W  OE     5.4.0-139-generic #156-Ubuntu
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093731] drbdsetup       D    0  3646      1 0x00004004
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093735] Call Trace:
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093748]  __schedule+0x2e3/0x740
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093751]  schedule+0x42/0xb0
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093761]  dtr_free+0x315/0x350 [drbd_transport_rdma]
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093767]  ? __wake_up_pollfree+0x40/0x40
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093815]  drbd_transport_shutdown+0x64/0xa0 [drbd]
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093829]  drbd_unregister_connection+0x12e/0x220 [drbd]
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093843]  del_connection+0x95/0x150 [drbd]
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093845]  ? _cond_resched+0x19/0x30
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093858]  drbd_adm_down+0xee/0x350 [drbd]
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093860]  ? remove_wait_queue+0x47/0x50
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093865]  genl_family_rcv_msg+0x1b9/0x470
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093868]  genl_rcv_msg+0x4c/0xa0
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093870]  ? _cond_resched+0x19/0x30
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093872]  ? genl_family_rcv_msg+0x470/0x470
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093874]  netlink_rcv_skb+0x50/0x120
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093876]  genl_rcv+0x29/0x40
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093878]  netlink_unicast+0x1a8/0x250
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093880]  netlink_sendmsg+0x240/0x480
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093883]  ? mem_cgroup_commit_charge+0x63/0x490
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093888]  sock_sendmsg+0x65/0x70
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093891]  sock_write_iter+0x93/0xf0
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093894]  new_sync_write+0x125/0x1c0
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093897]  __vfs_write+0x29/0x40
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093899]  vfs_write+0xb9/0x1a0
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093901]  ksys_write+0x67/0xe0
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093903]  __x64_sys_write+0x1a/0x20
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093907]  do_syscall_64+0x57/0x190
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093910]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093913] RIP: 0033:0x7f08fa258077
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093918] Code: Bad RIP value.
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093920] RSP: 002b:00007ffd3fb2c108 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093922] RAX: ffffffffffffffda RBX: 000000000000004c RCX: 00007f08fa258077
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093923] RDX: 000000000000004c RSI: 0000564099f3dd40 RDI: 0000000000000004
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093924] RBP: 0000564099f3dd40 R08: 00007ffd3fb2c184 R09: 0000000000000000
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093925] R10: 0000564099f3b010 R11: 0000000000000246 R12: 000000000000004c
Jun 12 15:34:36 os-hv-am-4 kernel: [  852.093926] R13: 0000000000000004 R14: 00000000ffffffff R15: 0000000000000000

drbd version: 9.2.2

to reproduce

setup rdma synced disk. setup first node, setup second node. connect them, disconnect them, try to shut any of the two down. They will be stuck forever and only a hard reboot will release this.

wzrdtales commented 1 year ago

edit just up and down is enough to get stuck. so this is a problem with rdma in general as is it seems

wzrdtales commented 1 year ago

tested the other releases, this bug exists since the first release 9.2.0 and was never fixed also not in 9.2.4

wzrdtales commented 1 year ago

I test wise decremented the counter before the event

https://github.com/LINBIT/drbd/blob/460cfc1025ba5abb63ac9ed2895d6cec178bb39c/drbd/drbd_transport_rdma.c#L572

This fixes this issue, but wasn't able yet to debug why this decrement is never called. Indeed it really never does so, we had a server sitting for a week in that state.

wzrdtales commented 1 year ago

maybe @rck can help out from here?

wzrdtales commented 1 year ago

when a connection is made it will again get stuck in the same position, with higher cm_counts. When a connection is cut unexpectedly (we simulated a crash), we end up with the same situation. Those things are not being freed anymore, it is just stuck.

wzrdtales commented 1 year ago

further debugging shows, there are deeper problems with the rdma driver. The whole system will eventually lock up after some while blocking any further file reads.