Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
630 stars 57 forks source link

BEESD fails to notice disk has disappeared. #231

Open krefson opened 1 year ago

krefson commented 1 year ago

I have been having problems with bees since migrating my deduped backup disk to another server, with the log filling up with alarming messages like 2022-10-04T00:00:38.426117+01:00 toaster beesd[5034]: crawl_transid[5157]: *** EXCEPTION *** 2022-10-04T00:00:38.426148+01:00 toaster beesd[5034]: crawl_transid[5157]: #011exception type std::system_error: BTRFS_IOC_TREE_SEARCH_V2: /run/bees/Keiths_Test_Archive at fs.cc:798: Input/output error 2022-10-04T00:00:38.426176+01:00 toaster beesd[5034]: crawl_transid[5157]: *** 2022-10-04T00:00:38.426896+01:00 toaster kernel: [48208.875653][ T5157] BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 4115, rd 162984, flush 0, corrupt 0, gen 0 2022-10-04T00:00:38.426908+01:00 toaster kernel: [48208.875681][ T5157] BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 4115, rd 162985, flush 0, corrupt 0, gen 0 2022-10-04T00:00:38.426909+01:00 toaster kernel: [48208.875831][ T5157] BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 4115, rd 162986, flush 0, corrupt 0, gen 0 2022-10-04T00:00:38.426910+01:00 toaster kernel: [48208.876788][ T5157] BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 4115, rd 162987, flush 0, corrupt 0, gen 0 2022-10-04T00:00:38.426732+01:00 toaster beesd[5034]: crawl_transid[5157]: Next root from 5 2022-10-04T00:00:38.426779+01:00 toaster beesd[5034]: crawl_transid[5157]: next_root(5) 2022-10-04T00:00:38.426808+01:00 toaster beesd[5034]: crawl_transid[5157]: insert_root(BeesCrawlState 5:0 offset 0x0 transid 43678..44691 started 2022-10-04-00-00-38 (0s ago)) 2022-10-04T00:00:38.426836+01:00 toaster beesd[5034]: crawl_transid[5157]: excess_roots.erase(5) 2022-10-04T00:00:38.427084+01:00 toaster beesd[5034]: crawl_transid[5157]: calling insert_new_crawl

Oh no! However of further inspection I spotted that following a system glitch, the /dev/sdc1 device had disappeared and the disk had been re-detected as /dev/sdd1. I'm still attempting to get to the bottom of the glitch. Here's the log extract FWIW

2022-10-01T22:13:02.272274+01:00 localhost kernel: [26479.051901][T27120] scsi 3:0:0:0: rejecting I/O to dead device 2022-10-01T22:13:02.272275+01:00 localhost kernel: [26479.052700][T27120] BTRFS: error (device sdc1) in btrfs_commit_transaction:2341: errno=-5 IO failure (Error while writing out transaction) 2022-10-01T22:13:02.272276+01:00 localhost kernel: [26479.052705][T27120] BTRFS info (device sdc1): forced readonly 2022-10-01T22:13:02.272277+01:00 localhost kernel: [26479.052707][T27120] btrfs_printk: 268 callbacks suppressed 2022-10-01T22:13:02.272277+01:00 localhost kernel: [26479.052708][T27120] BTRFS warning (device sdc1): Skipping commit of aborted transaction. 2022-10-01T22:13:02.272278+01:00 localhost kernel: [26479.052710][T27120] BTRFS: error (device sdc1) in cleanup_transaction:1945: errno=-5 IO failure 2022-10-01T22:13:02.900832+01:00 localhost kernel: [26479.678053][T28236] scsi 4:0:0:0: Direct-Access ST2000LM 007-1R8174 0 PQ: 0 ANSI: 6 2022-10-01T22:13:02.900854+01:00 localhost kernel: [26479.678962][T28236] sd 4:0:0:0: Attached scsi generic sg2 type 0 2022-10-01T22:13:02.900855+01:00 localhost kernel: [26479.679425][T28280] sd 4:0:0:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB) 2022-10-01T22:13:02.900858+01:00 localhost kernel: [26479.679428][T28280] sd 4:0:0:0: [sdd] 4096-byte physical blocks

However beesd failed to notice that the disk it had been deduping had disappeared and carried on merrily failing for several hours until I stopped it manually. I humbly suggest that this is a bug, and that beesd really ought to notice when the subject of its attentions has vanished!

The disk passes the SMART tests, and "btrfs check" reveals no errors.

`toaster:~ # beesd -v Usage: beesd [options]


bees version 0.7

cat /etc/os-release NAME="openSUSE Leap" VERSION="15.4" toaster:~ # uname -a

Linux toaster 5.14.21-150400.24.21-default #1 SMP PREEMPT_DYNAMIC Wed Sep 7 06:51:18 UTC 2022 (974d0aa) x86_64 x86_64 x86_64 GNU/Linux

toaster:~ # smartctl -a /dev/sdc smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.14.21-150400.24.21-default] (SUSE RPM) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION === Model Family: Seagate Mobile HDD Device Model: ST2000LM007-1R8174 Serial Number: WDZA2HEL LU WWN Device Id: 5 000c50 0ac1c9d50 Firmware Version: SBK2 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Form Factor: 2.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 3b SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Tue Oct 4 11:05:15 2022 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled `

Zygo commented 1 year ago

It's a non-trivial problem. Normally udev or similar umounts the filesystem when a device is removed, but bees prevents the filesystem from ever being umounted by holding open file descriptors. Even with lazy umount, the umount is deferred until bees terminates. The service wrapper would have to notice the device removal (and account for all the multi-device cases), terminate bees, then bees releases its open file descriptors, then the filesystem can be umounted.

The subvol searches do fail, but they can also fail during normal operation during read error recovery on raid1[1]. The check for the root subvol 5 doesn't fail during normal operation, but because all the pages are cached in RAM, it also won't fail when a device is dropped. This leaves no reliable way to distinguish between a transient error and total filesystem collapse.

[1] Yes that is a btrfs kernel bug, confirmed on 5.15.68.