digint / btrbk

Tool for creating snapshots and remote backups of btrfs subvolumes
https://digint.ch/btrbk/
GNU General Public License v3.0
1.68k stars 122 forks source link

Question: Checksum error on backup drive, single file corrupted, good file still exists on primary drive #350

Open b0o opened 3 years ago

b0o commented 3 years ago

Hello! Thank you for your work in creating btrbk. It has been instrumental for my team!

I have a question about a scenario I recently ran into:

A routine btrfs scrub revealed a singe uncorrectable checksum error on my send-receive target drive. Investigating further, I found the affected file and verified that it is indeed corrupt by attempting to cat $file > /dev/null, which fails with "Input/output error".

The original version of the file still exists, intact and unmodified, on my primary filesystem. The snapshots on my primary filesystem which contain the file are also unaffected. Presumably due to incremental backups, the file remains corrupt even in the latest backups that contain it on the send-receive target.

Would it be possible for btrbk to recognize this scenario and use the good copy to repair/replace the corrupted one automatically? From my understanding, this is what btrfs scrub would do to repair the file if it knew there was a good copy available.

If I run a single non-incremental backup, that should force the good copy to be transferred to the target, correct? I assume that would fix the file going forward, but the older backups would still contain the corrupt copy of the file.

I finally want to say that the file in question is not vital to me in any way, but I would like to figure this out in case it happens to an important file in the future.

My btrbk.conf:

snapshot_preserve_min   1d
snapshot_preserve       10d

target_preserve_min     latest
target_preserve         24h 30d 10w 1m

timestamp_format long-iso

volume /mnt/btrfs-pool
  snapshot_dir @snapshots
  target send-receive /mnt/backup-01

  subvolume @
    snapshot_name @

  subvolume @home
    snapshot_name @home

  # ... more subvolumes

Thank you!

digint commented 3 years ago

It is possible to fix the file by changing the backup subvolume to read/write using btrfs property set, then fix the broken file, then switch it back again. While this may work (I know of people successfully doing this), I would not recommend it as it can completely break subsequent incremental send/receive without notice.

If I run a single non-incremental backup, that should force the good copy to be transferred to the target, correct?

correct. This is the safest way, but of course will also fill your target disk most.

A "middle way" would be to pinpoint the last backup containing no errors, and if the corresponding snapshot (uuid matching received-uudi) still exists on the source you can send-receive using this one as parent.

b0o commented 3 years ago

Thanks for that info. Do you think it would make sense to integrate an automatic repair feature into btrbk itself, or is that out of scope of this project?

luxagen commented 2 years ago

IMO this is out of scope as it's a general problem with BTRFS. I've come across this on subvolumes way too large to bother deleting and sending from scratch, and I tend to just rewrite the file on one side and resend.

e.g. in your case, you could:

  1. move the good copy somewhere else on the source disk: mv file good-file
  2. copy it (without reflinking or hardlinking) back to the original location: cp good-file file
  3. delete the moved original: rm good-file
  4. make a fresh snapshot and send it incrementally: btrbk run

The effect of this will be to waste space, but only on the order of the size of the affected file. More importantly, you don't waste hours resending non-incremental snapshots. You could also delay the deletion of good-file above until you're sure it both copied and sent correctly.