akiradeveloper / dm-writeboost

Log-structured Caching for Linux
GNU General Public License v2.0
120 stars 18 forks source link

can't mount a disk with writeboost cache because of kernel crash on writeboost module. #235

Open fbhradec opened 2 years ago

fbhradec commented 2 years ago

So, I'me having trouble to mount a ext4 disk with a writeboost cache after a power loss.

The disk and cache are vfio scsi, inside a KVM virtual machine.

once the machine boots, dmwriteboost starts to "mount" the cache device on top of the disk, and a few minutes after I see this in dmesg:

[  168.932006] rcu: INFO: rcu_sched self-detected stall on CPU                                                                                                                                                        
[  168.932352] rcu:     3-....: (5249 ticks this GP) idle=caa/1/0x4000000000000002 softirq=950/950 fqs=2623                                                                                                           
[  168.932691] rcu:      (t=5250 jiffies g=4717 q=6790)                                                                                                                                                                                            
[  168.933036] NMI backtrace for cpu 3                                                                                                                                                                                
[  168.933042] CPU: 3 PID: 315 Comm: dmsetup Tainted: G           OE     4.19.0-18-amd64 #1 Debian 4.19.208-1
[  168.933042] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[  168.933045] Call Trace:                                                                                                                                                                                                                                                                                         
[  168.933055]  <IRQ>                                                                                                                                                                                                                                                                                              
[  168.933062]  dump_stack+0x66/0x81                                                                                                                                                                                                                                                                               
[  168.933063]  nmi_cpu_backtrace.cold.4+0x13/0x50                                                                                                                                                                                                                                                                 
[  168.933068]  ? lapic_can_unplug_cpu+0x80/0x80                                                                                                                                                                                                                                                                   
[  168.933072]  nmi_trigger_cpumask_backtrace+0xf9/0x100                                                                                                                                                                                                                                                           
[  168.933073]  rcu_dump_cpu_stacks+0x9b/0xcb                                                                                                                                                                                                                                                                      
[  168.933074]  rcu_check_callbacks.cold.81+0x1db/0x335                                                                           
[  168.933076]  ? tick_sched_do_timer+0x60/0x60                                                                                                                                                                       
[  168.933078]  update_process_times+0x28/0x60                                                                                                                                                                                                                                                                    
[  168.933080]  tick_sched_handle+0x22/0x60                                                 
[  168.933081]  tick_sched_timer+0x37/0x70                                                                                                                                                                            
[  168.933083]  __hrtimer_run_queues+0x100/0x280                                                                                                                                                                      
[  168.933084]  hrtimer_interrupt+0x100/0x210                                                                                     
[  168.933086]  ? handle_irq_event+0x47/0x60                                                                                                                                                                          
[  168.933087]  smp_apic_timer_interrupt+0x6a/0x140                                                          
[  168.933090]  apic_timer_interrupt+0xf/0x20                                                                                     
[  168.933090]  </IRQ>                                                                                                                                                                                                
[  168.933093] RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20                                                                                                                                                        
[  168.933094] Code: d8 48 3d 90 d0 03 00 76 cc 80 4d 00 08 eb 98 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 e8 66 31 99 ff 66 90 48 89 f7 57 9d <0f> 1f 44 00 00 c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 07                                                                                             
[  168.933095] RSP: 0018:ffffa67cc3acfaf0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13                                                                                                                                                                                                                              
[  168.933096] RAX: 0000000000000001 RBX: ffff89dfbcdf7868 RCX: 000000000001fffe                                                                                                                                                                                                                                   
[  168.933096] RDX: 0000000000000001 RSI: 0000000000000206 RDI: 0000000000000206                                                                                                                                                                                                                                   
[  168.933097] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff89dfbcdf7978                                                                                                                                                                                                                                   
[  168.933097] R10: ffff89dfb8419000 R11: 0000000000000001 R12: ffffa67cdd2eaf78                                                  
[  168.933097] R13: ffffa67cdd2eae60 R14: ffff89dfb8419000 R15: 0000000000000000                                                                                                                                      
[  168.933102]  mark_clean_mb+0x34/0x40 [dm_writeboost]                                                                                                                                                               
[  168.933104]  mark_clean_seg+0x3d/0x60 [dm_writeboost]                                   
[  168.933105]  resume_cache+0x558/0x910 [dm_writeboost]                                                                                                                                                                                                                                                           
[  168.933107]  writeboost_ctr+0x408/0x6a0 [dm_writeboost]                                                                                                                                                                                                                                                         
[  168.933117]  dm_table_add_target+0x17d/0x360 [dm_mod]                                                                          
[  168.933120]  table_load+0x122/0x2e0 [dm_mod]                                                                                                                                                                                                                                                                    
[  168.933123]  ? dev_status+0x40/0x40 [dm_mod]                                                                                                                                                                                                                                                                    
[  168.933125]  ctl_ioctl+0x1af/0x3f0 [dm_mod]                                                                                                                                                                        
[  168.933128]  dm_ctl_ioctl+0xa/0x10 [dm_mod]
[  168.933130]  do_vfs_ioctl+0xa4/0x630
[  168.933134]  ? ksys_semctl+0x129/0x160
[  168.933135]  ksys_ioctl+0x60/0x90
[  168.933136]  __x64_sys_ioctl+0x16/0x20
[  168.933138]  do_syscall_64+0x53/0x110
[  168.933140]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  168.933142] RIP: 0033:0x7fabe6e0d427
[  168.933144] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
[  168.933144] RSP: 002b:00007ffe93cb46b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[  168.933145] RAX: ffffffffffffffda RBX: 00007fabe6eeba7f RCX: 00007fabe6e0d427
[  168.933145] RDX: 00005622fbcced40 RSI: 00000000c138fd09 RDI: 0000000000000003
[  168.933146] RBP: 00007ffe93cb4770 R08: 00007fabe6f3a310 R09: 00007ffe93cb4520
[  168.933146] R10: 00007fabe6f3965a R11: 0000000000000206 R12: 00007fabe6f3965a
[  168.933147] R13: 00007fabe6f3965a R14: 00007fabe6f3965a R15: 00007fabe6f3965a
[  196.244011] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [dmsetup:315]
[  196.244369] Modules linked in: kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel qxl ttm drm_kms_helper intel_rapl_perf drm evdev joydev virtio_console serio_raw pcspkr sg virtio_balloon qemu_fw_cfg button nfsd auth_rpcgss dm_writeboost(OE) libcrc32c nfs_acl lockd dm_mod grace su
nrpc ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb sr_mod cdrom sd_mod ata_generic virtio_net net_failover failover virtio_scsi crc32c_intel ata_piix libata uhci_hcd ehci_hcd aesni_intel scsi_mod psmouse usbcore aes_x86_64 crypto_simd cryptd glue_helper i2c_piix4 virtio_pci
 virtio_ring virtio usb_common floppy
[  196.244392] CPU: 3 PID: 315 Comm: dmsetup Tainted: G           OE     4.19.0-18-amd64 #1 Debian 4.19.208-1

this message keeps repeating for a while until the vm freezes.

I've tried to mount the disk without writeboost, but it won't mount. Even fsck can't fix it, probably because there's data on the cache that needs to be flushed to disk.

Is there any way to force the cache data to be dumped to disk (hence fixing the disk) in a situation like this?

Do you think upgrading the kernel (currently 4.19.0-18 debian) would help getting writeboost to finish the initialization without crashing?

thanks for any insights...

fbhradec commented 2 years ago

btw, forgot to mention... the process dmsetup create stays running at 100% CPU after the first rcu: INFO: rcu_sched self-detected stall on CPU message, with no disk activity.

akiradeveloper commented 2 years ago

From your information, there could be a contention in getting spinlock. This is not normal because there is no chance of such situation during the initialization.

Q1. Before the power loss, have you experienced any successful reboot?

Q2. Module version?

Q3. Did you retried reboot?

Btw, Done do this.

I've tried to mount the disk without writeboost, but it won't mount. Even fsck can't fix it

there's data on the cache that needs to be flushed to disk.

Remaining persistent cache on the caching device is normal. You shouldn't say "needs to be" because there is no actual need for this.

akiradeveloper commented 2 years ago

I don't think the cache block is broken because Writeboost writes checksum along with the cache data. This will save us from power loss.

struct segment_header_device {
    /*
     * We assume 1 sector write is atomic.
     * This 1 sector region contains important information such as checksum
     * of the rest of the segment data. We use 32bit checksum to audit if
     * the segment is correctly written to the cache device.
     */
    /* - FROM ------------------------------------ */
    __le64 id;
    __le32 checksum;
    /*
     * The number of metablocks in this segment header to be considered in
     * log replay.
     */
    __u8 length;
    __u8 padding[512 - (8 + 4 + 1)]; /* 512B */
    /* - TO -------------------------------------- */
    struct metablock_device mbarr[0]; /* 16B * N */
} __packed;
akiradeveloper commented 2 years ago

Really due to a power failure? This may be due to your system is broken. See https://github.com/akiradeveloper/dm-writeboost/issues/197

fbhradec commented 2 years ago

From your information, there could be a contention in getting spinlock. This is not normal because there is no chance of such situation during the initialization.

Q1. Before the power loss, have you experienced any successful reboot?

Yes, I could reboot without problem. Actually, it wasn't the first power loss I had on that machine, and on all other power losses it was able to mount the drive with the cache without a problem.

Q2. Module version?

the latest git master branch head.

Q3. Did you retried reboot?

Yep. I retried reboot multiple times, without luck.

I've tried to mount the disk without writeboost, but it won't mount. Even fsck can't fix it

Actually, I kept trying fsck the disk without the cache, and at some point it did "fixed" the disk, but all data ended up in the "lost&found" folder. Although it was a bit of a pain, I was able to locate the files that I needed to save, and was able to "reconstruct" the root manually, by moving files out the the lost&found folder.

So... it ended up ok! :)

there's data on the cache that needs to be flushed to disk. Remaining persistent cache on the caching device is normal. You shouldn't say "needs to be" because there is no actual need for this.

what I meant was, to be able to fsck the disk without the cache (since I couldn't mount with the cache anymore, and because of that, I couldn't remove it from the disk), there was missing data on the physical disk that was "residing" in the cache.

I understand it is normal to have persistent data on the cache, but in my situation where the cache couldn't be attached to the physical disk anymore, that data is essentially lost since now the logical block device can't access it, and the physical disk doesn't have it. Hence, fsck accused tons of errors.

Really due to a power failure? This may be due to your system is broken. See https://github.com/akiradeveloper/dm-writeboost/issues/197

well, it happen after I accidentally powered off the wifi smartplug where the machine is plugged to ac. So, I would say it did happen during an unexpected power failure?!?

It may well be faulty memory, indeed, like #197... It's a bit difficult for me to run a memory check, since that machine is in a different continent than I. But at some point I'll ask someone there to run it for me and update it here!

But the one thing I still am puzzled about is the fact I was running dm-writeboost on a Virtual Machine (KVM), and the "physical" disk was a virtio raw file on an ext4 filesystem running on a normal sata HD, and the ssd cache was also a virtio raw file on a ext4 filesystem running on a NVME disk. There maybe be some unknown interaction from the virtio/vfio KVM subsytem and dm-writeboost that could also be the reason? Or maybe data was lost by virtio/vfio when the power loss happened, before it could be delivered to dm-writeboost...

anyhow, I'll hold on using dm-writeboost over virtio/vfio for now...

thanks for the support, and sorry the delay to reply... after I retrieved the data I was so happy and I ended up forgetting to report it here! Won't happen again!