elastio / elastio-snap

kernel module for taking block-level snapshots and incremental backups of Linux block devices
GNU General Public License v2.0
21 stars 6 forks source link

An issue with CoW file lock on PopOS 22.04 with kernel 6.0.12-76060006-generic #212

Closed e-kov closed 1 year ago

e-kov commented 1 year ago

There is an issue with the CoW file truncation after the call transition to incremental:


[  693.318921] elastio-snap: ioctl command received: 1074020613
[  693.318927] elastio-snap: received transition inc ioctl - 7
[  693.318929] elastio-snap: allocating device struct
[  693.318931] elastio-snap: initializing tracer
[  693.318933] elastio-snap: creating kernel cow thread
[  693.319053] elastio-snap: getting the base block device's make_request_fn
[  693.319055] elastio-snap: original mrf is already replaced with the tracing_mrf = 000000000f1649ea
[  693.319057] elastio-snap: found already traced device dm-5 with the same original bd_ops. orig mrf = 0000000007ac3a93; orig ops = 000000005947874e; tracing ops = 00000000f799ecd7
[  693.319059] elastio-snap: using already existing tracing ops for device with minor 7
[  693.319073] elastio-snap: freezing 'dm-5'
[  693.608738] elastio-snap: starting tracing
[  693.608751] elastio-snap: thawing 'dm-5'
[  693.644130] elastio-snap: stopping cow thread
[  693.644877] elastio-snap: error performing truncation: -1
[  693.644892] elastio-snap: error truncating file: -1
[  693.644906] elastio-snap: warning: failed to truncate '/tmp/elastio-snap_007/cow.snap', incremental will use more disk space than needed`
[  693.644911] elastio-snap: stopping mrf thread
[  693.644965] elastio-snap: freeing gendisk
[  693.681131] elastio-snap: freeing request queue
[  693.681591] elastio-snap: minor range = 7 - 7

It was reproduced by the tests execution sudo ./elio-test.sh in the PopOS VM.

Tests output doesn't contain any error:

elastio-snap: 7612806
kernel: 6.0.12-76060006-generic
filesystem: ext4
gcc: 11.3.0-1ubuntu1~22.04)
bash: 5.1.16(1)-release
python: Python 3.10.6

test_destroy_active_incremental (test_destroy.TestDestroy) ... ok
test_destroy_active_snapshot (test_destroy.TestDestroy) ... ok
test_destroy_dormant_incremental (test_destroy.TestDestroy) ... ok
test_destroy_dormant_snapshot (test_destroy.TestDestroy) ... ok
test_destroy_nonexistent_device (test_destroy.TestDestroy) ... ok
test_destroy_unverified_incremental (test_destroy.TestDestroy) ... ok
test_destroy_unverified_snapshot (test_destroy.TestDestroy) ... ok
test_multipart_modify_origins (test_multipart.TestMultipart) ... ok
test_multipart_setup_volumes_same_disk (test_multipart.TestMultipart) ... ok
test_multipart_setup_volumes_write_last_after_destroy (test_multipart.TestMultipart) ... ok
test_reload_inc_device_no_exists (test_reload.TestReload) ... ok
test_reload_inc_invalid_minor (test_reload.TestReload) ... ok
test_reload_inc_volume_path_is_dir (test_reload.TestReload) ... ok
test_reload_snap_device_no_exists (test_reload.TestReload) ... ok
test_reload_snap_invalid_minor (test_reload.TestReload) ... ok
test_reload_snap_volume_path_is_dir (test_reload.TestReload) ... ok
test_reload_unverified_incremental (test_reload.TestReload) ... ok
test_reload_unverified_snapshot (test_reload.TestReload) ... ok
test_reload_verified_inc (test_reload.TestReload) ... ok
test_reload_verified_snapshot (test_reload.TestReload) ... ok
test_setup_2_volumes (test_setup.TestSetup) ... ok
test_setup_already_tracked_volume (test_setup.TestSetup) ... ok
test_setup_cow_file_path_is_dir (test_setup.TestSetup) ... ok
test_setup_invalid_minor (test_setup.TestSetup) ... ok
test_setup_readonly_volume (test_setup.TestSetup) ... ok
test_setup_unmounted_volume (test_setup.TestSetup) ... ok
test_setup_volume (test_setup.TestSetup) ... ok
test_setup_volume_path_is_dir (test_setup.TestSetup) ... ok
test_modify_origin (test_snapshot.TestSnapshot) ... ok
test_next_available_minor (test_snapshot.TestSnapshot) ... ok
test_track_writes (test_snapshot.TestSnapshot) ... ok
test_transition_active_incremental (test_transition_incremental.TestTransitionToIncremental) ... ok
test_transition_active_snapshot (test_transition_incremental.TestTransitionToIncremental) ... ok
test_transition_fs_sync_cow_full (test_transition_incremental.TestTransitionToIncremental) ... ok
test_transition_mod_sync_cow_full (test_transition_incremental.TestTransitionToIncremental) ... ok
test_transition_nonexistent_snapshot (test_transition_incremental.TestTransitionToIncremental) ... ok

----------------------------------------------------------------------
Ran 36 tests in 32.618s

OK

dmesg.log

And there are no such errors in dmesg if all occurrences of file_lock(filp);, file_unlock(filp);, file_unlock_mark_dirty(filp); are commented in src/elastio-snap.c.

The issue have seen just on PopOS and I'm not sure, will we see it on other distros. Thus it's not a high priority until it's seen at least on one other distro.

kgermanov commented 1 year ago

Looks like as race condition between update inode and write to inode. Tests are not failed, because this is as warning Also sometimes there are error on writes (if exists writes at transitional moment):

root@user-vm:/home/kgermanov/work# git log -n 1
commit b42b83074fc79a1ab15f2c725192b4581a7edc7e (HEAD -> elastio_master, elastio/master)

root@user-vm:/home/kgermanov/work# uname -a
Linux user-vm 5.15.0-56-generic #62~20.04.1-Ubuntu SMP Tue Nov 22 21:24:20 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

root@user-vm:/home/kgermanov/work# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.1 LTS
Release:        20.04
Codename:       focal

root@user-vm:/home/kgermanov/work# mount -l | grep sda5
/dev/sda5 on / type ext4 (rw,relatime,errors=remount-ro)

root@user-vm:/home/kgermanov/work# elioctl setup-snapshot  -c 300 -f 204 /dev/sda5 /.eliosnap/test_snap1 3
root@user-vm:/home/kgermanov/work# ls -la /.eliosnap/
----------  1 root root  213909504 Jan 12 14:09 test_snap1

root@user-vm:/home/kgermanov/work/agent/bdsnap# elioctl transition-to-incremental 3
root@user-vm:/home/kgermanov/work/agent/bdsnap# ls -la /.eliosnap/
----------  1 root root  145756160 Jan 12 14:09 test_snap1

root@user-vm:/home/kgermanov/work# cat /proc/elastio-snap-info
{
        "version": "0.12.0",
        "devices": [
                {
                        "minor": 3,
                        "cow_file": "/.eliosnap/test_snap1",
                        "block_device": "/dev/sda5",
                        "max_cache": 300,
                        "fallocate": 213909504,
                        "seq_id": 1,
                        "uuid": "7ecc46bd131f4982a8185d330efe5e67",
                        "version": 1,
                        "nr_changed_blocks": 40,
                        "error": -1, //!!!!!
                        "state": 2,
                        "ignore_errors": 0
                }
        ]
}
root@user-vm:/home/kgermanov/work# dmesg -c

[2514916.618240] elastio-snap: ioctl command received: 1074020613
[2514916.618244] elastio-snap: received transition inc ioctl - 3
[2514916.618247] elastio-snap: allocating device struct
[2514916.618248] elastio-snap: initializing tracer
[2514916.618249] elastio-snap: creating kernel cow thread
[2514916.618390] elastio-snap: getting the base block device's make_request_fn
[2514916.618393] elastio-snap: original mrf is already replaced with the tracing_mrf = 00000000a7dc8bde
[2514916.618397] elastio-snap: found already traced device sda5 with the same original bd_ops. orig mrf = 00000000d8fb8b26; orig ops = 00000000a8d9c0f5; tracing ops = 00000000c19dbeaa
[2514916.618401] elastio-snap: using already existing tracing ops for device with minor 3
[2514916.618407] elastio-snap: freezing 'sda5'
[2514916.874849] elastio-snap: starting tracing
[2514916.874853] elastio-snap: thawing 'sda5'
[2514916.876026] elastio-snap: stopping cow thread
[2514916.878064] elastio-snap: error performing truncation: -1
[2514916.878063] elastio-snap: error performing file 'write': 4096, 32768: -1
[2514916.878071] elastio-snap: error writing cow manager section to file: -1
[2514916.878073] elastio-snap: error writing cow manager section 0 to file: -1
[2514916.878076] elastio-snap: error cleaning cow manager mappings: -1
[2514916.878078] elastio-snap: error writing cow mapping: -1
[2514916.878080] elastio-snap: error handling sset: -1
[2514916.878082] elastio-snap: error handling sector set in kernel thread: -1
[2514916.878094] elastio-snap: error truncating file: -1
[2514916.878105] elastio-snap: warning: failed to truncate '/.eliosnap/test_snap1', incremental will use more disk space than needed
[2514916.878109] elastio-snap: stopping mrf thread
[2514916.878164] elastio-snap: freeing gendisk
[2514916.897196] elastio-snap: freeing request queue
[2514916.897358] elastio-snap: minor range = 3 - 3

Interesting, that truncate realy executed, but error return in notify_change I tried wrap file_unlock/file_lock this place: https://github.com/elastio/elastio-snap/blob/master/src/elastio-snap.c#L1827 But without success