elastio / elastio-snap

kernel module for taking block-level snapshots and incremental backups of Linux block devices
GNU General Public License v2.0
19 stars 7 forks source link

Fix kernel panic on snapshot destroy for a partition #160

Closed e-kov closed 2 years ago

e-kov commented 2 years ago

Fixed kernel panic on snapshot destroy for a partition

The kernel panic was happening in case if there are multiple snapshot devices are for the multiple partitions of the same disk. The problem was for the kernels 5.9+.

There is a logic for these kernels to replace block_device_operations structure with the driver's tracing function instead of an original submit_bio. This struct belongs to the disk, and it's shared between partitions of the disk.

The issue was in the access to the freed memory after ours tracing struct was freed.

Now we are not allocating a new struct when setting up a snapshot for the 2nd+ partition of the disk. And this struct is freed just when the last snapshot for some partition of the disk has been destroyed.

The driver has the same behavior with the make_request function in the bio queue for the kernel versions before 5.9. It's replaced with the tracking function on the first setup snapshot operation for multiple partitions of the disk. And an original make_request is set back when the last snapshot device has been destroyed respectively. So, now this behaviour is consistent for all Linux kernel versions.

Fixes https://github.com/elastio/elastio-snap/issues/155

Implemented new tests for snaps of partitions of the same disk

The idea is to add a test on the loopback device with 2 or more partitions. This test should reproduce the bug #155 like this: 1) create snapshots for both partitions of the disk (loopback device); 2) destroy 2nd snapshot device; 3) perform write to the 2nd partition; 4) wait a bit.

These steps are leading to the kernel panic without the fix.

Also added 2 other tests: simple setup test and a test with writes to all partitions.

Fixed tests if they are running on machine with '/' on LVM

There is a test test_setup_2_volumes which uses one synthetic device and 2nd is a root volume of the host machine. This test was failing due to the multiple and different device names for LVM devices like /dev/mapper/ubuntu--vg-ubuntu--lv and /dev/dm-0 which are the same device.

e-kov commented 2 years ago

@anelson @vsazhenyuk-softheme since your last review: