elastio / elastio-snap

kernel module for taking block-level snapshots and incremental backups of Linux block devices
GNU General Public License v2.0
21 stars 6 forks source link

Errors in `dmesg` like `Objects remaining in bio-456 on __kmem_cache_shutdown()` on kernels 5.17+ #213

Closed e-kov closed 1 year ago

e-kov commented 1 year ago

The issue has been observed on PopOS and Fedora 35-37 with the kernels starting from 5.17. And it's not reproducible on Ubuntu 22.04 with kernel 5.15. It's reproducible by our tests in case if the tracked device being dismounted. It can be reproduced in our tests by sudo ./elio-test.sh -t test_destroy or sudo ./elio-test.sh -t test_reload.

Dmesg output looks like this:


[1251750.335749] EXT4-fs (loop0): unmounting filesystem.
[1251750.338381] elastio-snap: post umount check succeeded
[1251750.338383] elastio-snap: umount returned: 0
[1251750.338756] elastio-snap: ioctl command received: 1074020612
[1251750.338760] elastio-snap: received destroy ioctl - 20
[1251750.343009] elastio-snap: umount returned: -2
[1251750.842908] elastio-snap: failed wait for all submitted BIOs to be processed after 500 ms. bio submitted = 18, bio processed = 1
[1251750.842914] elastio-snap: replacing make_request_fn
[1251750.842921] elastio-snap: warning: no super found for device 'loop0', unable to freeze it
[1251750.842923] elastio-snap: ending tracing
[1251750.842925] elastio-snap: stopping mrf thread
[1251750.842965] elastio-snap: freeing gendisk
[1251750.856863] elastio-snap: freeing request queue
[1251750.857054] =============================================================================
[1251750.857057] BUG bio-456 (Tainted: G    B   W  OE     ): Objects remaining in bio-456 on __kmem_cache_shutdown()
[1251750.857059] -----------------------------------------------------------------------------

[1251750.857060] Slab 0x00000000828e301c objects=16 used=3 fp=0x00000000507e0ca5 flags=0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
[1251750.857069] CPU: 3 PID: 70477 Comm: python3 Tainted: G    B   W  OE      6.0.14-300.fc37.x86_64 #1
[1251750.857073] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[1251750.857079] Call Trace:
[1251750.857083]  
[1251750.857086]  dump_stack_lvl+0x44/0x5c
[1251750.857095]  slab_err.cold+0x4f/0x63
[1251750.857104]  __kmem_cache_shutdown+0x16f/0x3f0
[1251750.857110]  kmem_cache_destroy+0x51/0x160
[1251750.857115]  bioset_exit+0x143/0x190
[1251750.857122]  tracer_destroy+0x41/0xc0 [elastio_snap]
[1251750.857134]  ctrl_ioctl+0x73c/0x11a0 [elastio_snap]
[1251750.857144]  ? ioctl_has_perm.constprop.0.isra.0+0xca/0x110
[1251750.857396]  ? dequeue_task_stop+0x70/0x70
[1251750.857402]  __x64_sys_ioctl+0x90/0xd0
[1251750.857407]  do_syscall_64+0x5b/0x80
[1251750.857413]  ? __do_sys_wait4+0x8e/0xb0
[1251750.857419]  ? fpregs_restore_userregs+0x53/0xe0
[1251750.857641]  ? exit_to_user_mode_prepare+0x18f/0x1f0
[1251750.857646]  ? syscall_exit_to_user_mode+0x17/0x40
[1251750.857650]  ? do_syscall_64+0x67/0x80
[1251750.857654]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[1251750.857658] RIP: 0033:0x7fe5e0323baf
[1251750.857672] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[1251750.857674] RSP: 002b:00007ffeab578890 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[1251750.857678] RAX: ffffffffffffffda RBX: 00007ffeab578920 RCX: 00007fe5e0323baf
[1251750.857680] RDX: 00007ffeab5788fc RSI: 0000000040044104 RDI: 0000000000000003
[1251750.857682] RBP: 00007ffeab578910 R08: 00007fe500000001 R09: 00007fe5e05d643e
[1251750.857683] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe5dfd25838
[1251750.857685] R13: 0000000000000004 R14: 00007fe5dfb70ffc R15: 00007fe5df5fc330
[1251750.857689]  
[1251750.857693] Object 0x00000000f511777f @offset=2048
[1251750.857695] Object 0x000000004b6927c6 @offset=3072
[1251750.857696] Object 0x0000000010b18b83 @offset=4096

@skiptomy has a strong opinion that it's not a bug because some device has been dismounted before the request queue became empty and our driver is unloaded right after that in tests. And, the main statement, all remaining BIOs are submitted right after the dormant incremental becomes active when device has been mounted back.

I'm not so sure in safety of this behavior. At least we can try implement some logic to wait while queued BIOs are submitted before dismount. Driver controls mount/dismount hooks. And maybe it'll be a proper place to wait there before dismount.

Logs from Fedora 37 with kernel 6.0.14-300.fc37.x86_64 dmesg after sudo ./elio-test.sh -t test_reload: f37_reload.log dmesg after sudo ./elio-test.sh -t test_destroy: f37_destroy.log

e-kov commented 1 year ago

Another dmesg from PopOS 22.04 with kernel 6.0.12 dmesg_popos.log

skypodolsky commented 1 year ago

Closed after the investigation. New epic #219 created.

skypodolsky commented 1 year ago

Reopened as epic #219 has been implemented