datto / dattobd

kernel module for taking block-level snapshots and incremental backups of Linux block devices
GNU General Public License v2.0
560 stars 120 forks source link

loop device detach (losetup -d) hung #356

Open hongyuntw opened 4 months ago

hongyuntw commented 4 months ago

Hi, I've encountered an issue when using losetup -d to detach a loop device, it hangs. Here are the steps to reproduce:

  1. Create a loop device:
    dd if=/dev/zero of=./x.img count=400 bs=1M
    LOOP_DEVICE=$(losetup --find --show --partscan ./x.img) && echo $LOOP_DEVICE
    mkfs.ext4 -F $LOOP_DEVICE
    mkdir -p /mnt/tests/ && mount $LOOP_DEVICE /mnt/tests/
  2. Set up a snapshot: dbdctl setup-snapshot $LOOP_DEVICE /mnt/tests/.cow 0
  3. Destroy the snapshot: dbdctl destroy 0
  4. Unmount the device: umount /mnt/tests
  5. Detach the loop device (Hungs here): losetup -d $LOOP_DEVICE

I've used gdb to debug the kernel and found that the root cause is when detaching the loop device. If no one else is using it, the kernel (loop_clr_fd in loop.c) calls the __loop_clr_fd function internally. This function then calls blk_mq_freeze_queue, where the hang occurs.

The reason for the hang is due to abnormal ref count changes in the request queue of the loop device. Here is the image

image

In the second red box, it can be seen that the value of lo->lo_queue->q_usage_counter->data inexplicably increased from 1 to 22. This is very strange. I experimented a few times and found that sometimes it increases to over 100. This results in the inability to freeze lo->lo_queue.

I suspect this issue might be related to changes in the kernel loop device. Two commits seem particularly relevant, but i am not sure the root cause is related with them Commit 1 Commit 2

Additionally, this situation only occurs when we perform setup & destroy & umount before detaching, leading to a hang. If we follow the sequence setup -> destroy -> detach -> umount, or setup -> umount -> detach -> destroy, the losetup -d command won't result in a hang. This is because our module is still using the loop device, so it doesn't call __loop_clr_fd in loop_clr_fd .

And it may affect kernel versions 5.16 and above, confirmed on Fedora 34 (5.16.19 / 5.17.12) and Fedora 38 (6.2).

However, this error does not seem to affect physical disks but not sure will effect the ref cnt for request queue of disk.

Swistusmen commented 4 months ago

Hi man, thanks for raising that issue. We will look at this, sorry currently whole team has another priorities, but it should change soon and we will go back to this+ to your PR