digint / btrbk

Tool for creating snapshots and remote backups of btrfs subvolumes
https://digint.ch/btrbk/
GNU General Public License v3.0
1.7k stars 123 forks source link

Transfer size stats #280

Open mzealey opened 5 years ago

mzealey commented 5 years ago

It would be nice (perhaps when running -v run) to include details about how many bytes were sent which is presumably roughly equivalent of how much space the snapshot will take up based on the previous one? Perhaps this could also be saved somewhere and output in the stats so you can see roughly what the deltas are between snaps?

digint commented 5 years ago

I see two ways of implementing this:

  1. Add a command to the pipe (between btrfs send and btrfs receive), measuring the "transferred size of btrfs-send". This might not really reflect the size used on the target, but at least gives some magnitude. In order not to add too much to the pipe, I tried using mbuffer -v 2 (already in the pipe when using stream_buffer) which prints a summary. Sadly this does not work as mbuffer prints the status to the controlling terminal instead of file descriptor 2, making it impossible to catch from btrbk. Another approach would be to add dd (or any other command capable of printing a summary) to the pipe: this would introduce some more context switches and slow down things, but should work.

  2. A better approach would be to directly scan the target "received" subvolume. I've come with a little script for this:

received-length.sh:

SUBVOL=/path/to/subvolume
CGEN=$(btrfs subvolume show "$SUBVOL" | sed -n 's/\s*Gen at creation:\s*//p')
btrfs subvolume find-new "$SUBVOL" $((CGEN+1)) \
  | cut -d' ' -f7 \
  | tr '\n' '+' \
  | sed 's/\+\+$/\n/' \
  | bc

This simply sums up the "len" field from all modified files since the creation of the subvolume. Works fine, as btrfs receive first makes a snapshot of the parent subvolume, then adds the files according to the send-stream.

Issues:

I'm planning to implement this either with a new btrbk command, something like btrbk list backup-size.

This needs some more investigation, maybe there's a nicer way to get the "real size used on disk".

mzealey commented 5 years ago

I would think option 1 would be a reasonable estimate and not need much in the way of overhead. I seem to recall there is another way if qgroups are enabled but in my case that would not help

edit: remove quoted text

digint commented 5 years ago

I would think option 1 would be a reasonable estimate and not need much in the way of overhead.

Yes, this is also valuable information. Especially when you want to also have an estimate of the ssh traffic generated by btrbk.

Having a command for listing (option 2) has the advantage that it is reproducible, and also works for manually generated backups.

btrfs subvolume find-new above is not very accurate, and gives only a rough estimate of what is really added on disk (it ignores deleted files, shared extents e.g. by clone sources), etc.

For more accurate results, we need to do more extensive analysis on the block level, which unfortunately is very time consuming. I did some promising tests with extents-list, and implemented a very experimental btrbk extents-diff command on the extents-diff branch for testing.

yarikoptic commented 4 years ago

I was about to file a new issue begging for a new diffstat or diff -stat but it sounds that the desire is similar to the one discussed here - to see the summary of differences (not only the total size of new/modified files as diff reports I guess) between two snapshots. Even if reported sizes (deleted, added or modified) do not account for possible operations on CoW'ed files -- that already would be useful information.