andrewchambers / bupstash

Easy and efficient encrypted backups.
https://bupstash.io
MIT License
898 stars 31 forks source link

My /data/ dir doubled in size one night #362

Open d5ve opened 1 year ago

d5ve commented 1 year ago

Longtime and happy bupstash user here!

I use bupstash to take backup hourly during the working day from my macos laptop to a repo on a linux PC.

laptop$ BUPSTASH_REPOSITORY=ssh:... BUPSTASH_KEY=... bupstash put --exclude "a few dirs" /Users/d5ve

Then twice each night I rsync the whole bupstash repository from the PC to rsync.net.

pc$ /usr/bin/rsync -avH /backups/laptop ab-1234.rsync.net:bupstash

For some reason, the /data/ dir in the PC's bupstash repository recently doubled in size. I only noticed this due to my usage graph on rsync.net jumping from 380GB to 650GB on November 14th.

On the laptop, bupstash list shows a list of backups slowly growing from 240GB to 260GB over the past year.

On the PC, du -sh shows that the /data/ dir in the bupstash repo is now 610GB.

I ran a bupstash gc on the PC, and it deleted 4GB of chunks only.

The cronjob on the PC which performs the rsync runs at 2AM and 3AM. Looking at the logs around the night in question, the rsyncs the previous night showed the remote data on rsync.net being 389GB, then from the 14th Nov being 658GB.

Nov 14th doesn't seem to match any daylight savings changeover or anything like that.

Is there a way to interrogate the repository to find out what changed, and what the extra 300GB of files are?

d5ve commented 1 year ago

The laptop is running bupstash-0.12.0 from homebrew (though I may have updated it since Nov 14th)

The PC is running bupstash-0.12.0 (though I may have updated it since Nov 14th)

d5ve commented 1 year ago

bupstash diff id=some-id-on-the-10th-november :: id=some-id-from-today shows pretty much what I'd expect - some new photos and other documents. Maybe a couple of GBs of differences.

andrewchambers commented 1 year ago

I think the root cause may be that the most recent bupstash update has tweaked the deduplication algorithm (to enable higher performance) - this is not likely to happen automatically again in the future, and my apologies for the inconvenience.

d5ve commented 1 year ago

Is there any way that I can "clean up" some of the extra data in the repo on the PC?

The total size of the data being backed up from the laptop is about 300GB, which zips down to about the 260GB reported by bupstash for each recent backup.

Most of the data is a photo library, so I wouldn't expect there to be another 300GB of "diffs" in the bupstash repo data dir on the PC.

andrewchambers commented 1 year ago

@d5ve You would need to remove the snapshots since before the version upgrade and run bupstash gc to prune away the old data.

To further explain the repository growth - bupstash splits your photo library into pseudo random sized chunks and only ever stores a single copy of each chunk no matter how many backups they are present in. I made an update to the chunking algorithm means you now have two similar, but not identical sets of chunks which has disrupted deduplication.

Currently the easiest way to cleanup the repository is just to cycle out the old snapshots, though I think in the future I could try to think of a better solution if I need to change the repository format ever again.

ptman commented 1 year ago

There should probably be a way to rechunk old data. Especially if chunking is something that can be tweaked by the user