digint / btrbk

Tool for creating snapshots and remote backups of btrfs subvolumes
https://digint.ch/btrbk/
GNU General Public License v3.0
1.71k stars 124 forks source link

Failed send/receive with "inode_cache" mount option #253

Open jpbrown-15 opened 6 years ago

jpbrown-15 commented 6 years ago

I hope this helps others. I don't see this as a bug with btrbk, but rather a brtfs problem with inode_cache and send/receive. I was using btrbk for months without issue to send/receive for backups. Then I had a power blip and a set of other issues that corrupted a RAID0 pair of drives. I went back to the backups on the USB drive and discovered that there was a failure in my backups that I didn't detect. I was able to recover much of the drive content after setting up a new btrfs file system.

But when I began to back up the new file system, I was experiencing a successful first, complete send to populate the backup device, however all the subsequent incremental sends were failing:

2018-10-07T19:35:45-0500 send-receive starting /media/jpb/btr-backup/@.20181006 /mnt/root-home/btrbk_snapshots/@.20181006 /mnt/root-home/btrbk_snapshots/@.20181005 - 2018-10-07T19:35:47-0500 send-receive ERROR /media/jpb/btr-backup/@.20181006 /mnt/root-home/btrbk_snapshots/@.20181006 /mnt/root-home/btrbk_snapshots/@.20181005 - 2018-10-07T19:35:47-0500 delete_garbled starting /media/jpb/btr-backup/@.20181006 - - -

I was experiencing other snapshots that were successful against my newly created file system but not against one that had survived the power blip. So, I was scratching my head, looking for why btrbk was working for some, but not all on the same system.

The error message when trying to run the send/receive manually was a file not found that began with an o followed by some numbers. I don't have a good example. Essentially, it was an orphan file being presented during the transfer. Send/receive was having trouble with these files. It would find it on one side of the transfer but not the other.

Then I remembered I had turned on inode_cache as a mount option. I had not reactivated it on my newly created repaired drives.

Changing /etc/fstab from:

Caused the send/receive to fail: UUID=f7dca671-def7-49bf-a147-9f3db364bfdd / btrfs defaults,subvol=@,autodefrag,space_cache,noatime,inode_cache 0 1

To:

Fixed the send/receive: UUID=f7dca671-def7-49bf-a147-9f3db364bfdd / btrfs defaults,subvol=@,autodefrag,space_cache,noatime 0 1

I then rebooted, deleted all the btrbk snapshots for this subvolume, manually created the initial snapshot just as btrbk would and then issued a btrbk resume. The system has been running perfectly since.

I had added inode_cache to attempt to improve the performance against a pair of 4TB drives configured in btrfs RAID0 and another pair of 2TB drives configured in btrfs RAID1. I didn't notice a performance improvement, but did notice the send/receive failure -- just too late.

ghost commented 6 years ago

Thanks for the report. You should report it to the btrfs devs too. It probably should be considered a bug.

digint commented 6 years ago

Thanks!

I'm curios: did you notice any performance effect when enabling inode_cache? I kept my hands from it, as btrfs(5) clearly warns:

inode_cache
    [...]
    Defaults to off due to a potential overflow problem when the
    free space checksums don’t fit inside a single page.
    Don’t use this option unless you really need it. [...]

Sidenote: I also suggest using space_cache=v2. As I read from the btrfs mailing list, this is more robust and the recommended setting.

jpbrown-15 commented 6 years ago

Gatak - I'll file a report with btrfs devs. Working on a simple, repeatable use case. I'll update the original post with a better example of the orphan filenames as seen from the command line (and hopefully logs).

digint - I saw no noticeable performance improvement from inode_cache and have since deactivated it from all my mounts. This was probably due to my systems not experiencing the right conditions to make inode_cache beneficial. And given the trouble it posed with send/receive, I decided it wasn't worth the risk you point out.

Thanks for reminding me about space_cache=v2. The default is still v1 according to its manpage. I'll migrate to v2.

jpbrown-15 commented 6 years ago

All -- I've been working most of the day to create a simplified version of the system that had the trouble and replicate the problem. So far, I have not been able to reproduce it with just the inode_cache mount option. While removing that option resolved my send/receive problem, it must be more complicated than simply having that mount option set. I am continuing to add the complexity one layer at a time to see where it breaks. This is going to take more time to isolate the combination that is reproducible and reportable to the btrfs dev team.