elastio / elastio-snap

kernel module for taking block-level snapshots and incremental backups of Linux block devices
GNU General Public License v2.0
21 stars 6 forks source link

Implement the support of direct device IO operations #219

Closed skypodolsky closed 1 year ago

skypodolsky commented 1 year ago

A potentially dangerous scenario has been found and is related to the dormant snapshot functionality.

Briefly on it. When the driver is operating and the device is being snapshotted, there is a legal possibility to unmount it from the system to run some maintenance or diagnostics. When this happens, the elastio-snap driver is able to handle this case by switching the snapshot device to the so-known 'dormant', or unmounted state, where tracing of file operations and hence no COW takes place. To do this, the driver has mount/umount hooks set up: when the device is unmounted, we first run the driver's handler, then call the original system umount function afterward. When deinitialized, no COW thread is running, so no new bio requests can be processed. The device is waiting to be mounted again to restart the block IO tracing.

It's been verified though that additional bio requests are generated when the device is unmounted. This happens due to the sync & journaling operations. As we remember, the dormant snapshot state assumes there would be no operations to the block device, so this case leads us to the conflict situation when some of the bio requests cannot be traced and backed up by the driver. Moreover, some of the bio requests cannot be processed by the elastio-snap as the snap COW thread is stopped. Normally these bios wait for the device to be mounted again, but if we shut down the system in that state, some data will be lost irreversibly.

There were several approaches to cope with this problem, but no one has proven the efficacy as the principal problem remains the same: we kill the COW thread before ALL bio requests related to the device unmount are processed. But if we move the COW thread stop after the umount is handled, then no COW file exists on the disk anymore (as it is simply not mounted) and no tracing is possible anyway.

The use case including the system shut down is deeply important for the elastio-snap driver to assure the file system stability and data integrity. Hence, in order to suggest a solution, a deep investigation has been performed.

According to it, the proposed design includes the following:

This sounds like a hell of a job and a major reconsideration of some pieces in the driver, so this is why this epic is here. But this needs to be done to ensure data integrity as this is the most important aspect we must focus on.

e-kov commented 1 year ago

Some work remains with the CoW file reading from userspace when getting list of the changed blocks.

skypodolsky commented 1 year ago

Closed as fix #252 has been merged.