Open wzrdtales opened 1 year ago
edit just up and down is enough to get stuck. so this is a problem with rdma in general as is it seems
tested the other releases, this bug exists since the first release 9.2.0 and was never fixed also not in 9.2.4
I test wise decremented the counter before the event
This fixes this issue, but wasn't able yet to debug why this decrement is never called. Indeed it really never does so, we had a server sitting for a week in that state.
maybe @rck can help out from here?
when a connection is made it will again get stuck in the same position, with higher cm_counts. When a connection is cut unexpectedly (we simulated a crash), we end up with the same situation. Those things are not being freed anymore, it is just stuck.
further debugging shows, there are deeper problems with the rdma driver. The whole system will eventually lock up after some while blocking any further file reads.
update:
https://github.com/LINBIT/drbd/blob/460cfc1025ba5abb63ac9ed2895d6cec178bb39c/drbd/drbd_transport_rdma.c#L572
this is the line it gets stuck.
drbd with rdma gets stuck when disconnecting a resource in sync
here are the logs we could retrieve:
drbd version: 9.2.2
to reproduce
setup rdma synced disk. setup first node, setup second node. connect them, disconnect them, try to shut any of the two down. They will be stuck forever and only a hard reboot will release this.