LINBIT / windrbd

DRBD driver for windows
GNU General Public License v2.0
51 stars 20 forks source link

drbdadm attach stuck when running detach-attach-waitsync loop on Primary (Windows) with I/O #10

Closed johannesthoma closed 4 years ago

johannesthoma commented 4 years ago

After 26 iterations of detach-attach-waitsync.sh the attach command hangs. Primary (Windows 7) doing I/O (over Secondary, which is Linux when detached). Once drbdadm attach hangs followup drbdadm commands also hang (like drbdadm status).

Relevant log lines are:

Oct 15 20:15:27 192.168.56.103  U18:15:27.753|05243c60(netlink) #8789 resolve_nt_kernel_link <6>Symbolic link points to \Device\HarddiskVolume4
Oct 15 20:15:27 192.168.56.103  U18:15:27.769|05243c60(netlink) #8790 find_windows_device IoGetDeviceObjectPointer \Device\HarddiskVolume4 succeeded, targetdev is FFFFFA80036CF2B0
Oct 15 20:15:27 192.168.56.103  U18:15:27.769|05243c60(netlink) #8791 blkdev_get_by_path <7>blkdev_get_by_path succeeded FFFFFA8005C59850 windows_device FFFFFA80036CF2B0.
Oct 15 20:15:27 192.168.56.103  U18:15:27.769|05243c60(netlink) #8792 drbd_md_read <6>drbd w0/17 minor 5, ds(Diskless), dvflag(0x80002): meta-data IO uses: blk-bio
Oct 15 20:15:27 192.168.56.103  U18:15:27.769|05243c60(netlink) #8793 print_state_change <6>drbd w0/17 minor 5, ds(Diskless), dvflag(0x80002): disk( Diskless -> Attaching )
Oct 15 20:15:27 192.168.56.103  [last message was in IRQ context or recursive]
Oct 15 20:15:27 192.168.56.103  U18:15:27.769|05243c60(netlink) #8794 drbd_adm_attach <6>drbd w0/17 minor 5, ds(Attaching), dvflag(0x80002): Maximum number of peer devices = 1
Oct 15 20:15:27 192.168.56.103  U18:15:27.769|05243c60(netlink) #8795 drbd_bm_resize <6>drbd w0/17 minor 5, ds(Attaching), dvflag(0x80006): drbd_bm_resize called with capacity == 102400
Oct 15 20:15:27 192.168.56.103  U18:15:27.769|05243c60(netlink) #8796 drbd_bm_resize <6>drbd w0/17 minor 5, ds(Attaching), dvflag(0x80006): resync bitmap: bits=12800 words=200 pages=1
Oct 15 20:15:27 192.168.56.103  U18:15:27.769|05243c60(netlink) #8797 drbd_set_my_capacity <6>drbd w0/17 minor 5, ds(Attaching), dvflag(0x80006): size = 50 MB (51200 KB)
Oct 15 20:15:27 192.168.56.103  U18:15:27.769|05243c60(netlink) #8798 drbd_set_my_capacity got a valid size, unblocking SCSI capacity requests.
Oct 15 20:15:27 192.168.56.103  U18:15:27.769|05243c60(netlink) #8799 drbd_determine_dev_size <6>drbd w0/17 minor 5, ds(Attaching), dvflag(0x80006): size = 50 MB (51200 KB)
Oct 15 20:15:28 192.168.56.103  U18:15:28.066|0536bee0(devicecontrol) #8800 windrbd_process_netlink_packet drbd cmd(DRBD_ADM_GET_RESOURCES:30)
Oct 15 20:15:33 192.168.56.103  U18:15:33.066|0536bee0(devicecontrol) #8801 windrbd_process_netlink_packet failed to acquire the mutex, probably a previous drbd command is stuck.
Oct 15 20:15:38 192.168.56.103  U18:15:38.207|05c59ae0(devicecontrol) #8802 windrbd_process_netlink_packet drbd cmd(DRBD_ADM_GET_RESOURCES:30)
johannesthoma commented 4 years ago

Retrying test without I/O .. no hangs so far (120 iterations). Also no hang so far when Primary and not doing I/O (100 iterations). Maybe something wrong (again) with starting syncing?

johannesthoma commented 4 years ago

Attach also fails (after 20 iterations or so) when the reason for the detach was a disk fault (simulated with the inject-faults mechanism of WinDRBD).

johannesthoma commented 4 years ago

This is (also) due to a bug in the WinDRBD waitqueue implementation. When waking up after 30 seconds, drbdadm attach also terminates after 30 seconds. Same bug causes drbdadm down to hang in a up (connect) down loop.

johannesthoma commented 4 years ago

Fixed the wake_up call which fixes this hang. Patch will be contained in upcoming 1.0.0-rc8 release.

johannesthoma commented 4 years ago

Ran for 70+ iterations now without hang, closing this issue.

johannesthoma commented 4 years ago

100 iterations.