borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11.15k stars 743 forks source link

Repo size increases with no changes; deduplication issues? #4594

Closed shadowrylander closed 5 years ago

shadowrylander commented 5 years ago

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

Question

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

Operating system (distribution) and version.

Hardware / network configuration, and filesystems used.

How much data is handled by borg?

20 Megabytes; file sizes are in bytes and kilobytes.

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg create /s/borg/test/repo::test.2019.05.29.00.20.28 /s/borg/test/source --comment test.2019.05.29.00.20.28 --stats --progress --compression auto,zstd,22 --chunker-params 10,23,16,10

Describe the problem you're observing.

Every time I run the command, the repo increases in size by a megabyte or so, despite nothing having changed; I am keeping the same chunker parameters, as well as compression, and using the same cache directory. Is there anything I'm missing regarding the deduplication mechanism?

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Yes; the same command repeatedly.

Include any warning/errors/backtraces from the system logs

N/A

ThomasWaldmann commented 5 years ago

Every borg archive is a FULL backup that has all files in the input data set.

So, even if nothing changed, borg will create a stream of file metadata for all the files (and from within this metadata, it will reference all the content chunks [which are already in the repo, if nothing changed]).

Ideally, this whole stream of metadata will fully dedup against the previous stream of metadata (from the previous backup). You'll see a dedup size of only a few kB or so in that ideal case.

But, it can happen rather easily, that some metadata changes and spoils the deduplication.

E.g. if you access the files and the atime changes, then you'll get a slightly different metadata stream that dedups badly. You can try the --noatime option if you do not need to save the atime. The 2nd backup after changing to --noatime might dedup better (if atime really was the culprit).

shadowrylander commented 5 years ago

Ah; so the repo increasing from 11 to 29 megabytes after 83 or so runs, with no changes or access to the source, seems about right? I may have miscalculated how many times I ran the backup! 😅😅😅😅

ThomasWaldmann commented 5 years ago

I'ld expect less growth if the metadata dedup works.

you can also use borg info repo::archive to check the deduped size, it should be rather tiny if there were no changes.

shadowrylander commented 5 years ago

Apparently not; every time I make a backup, the deduplicated size increases by 3 kilobytes, so the last three backups had this archive deduplicated sizes of 358.24, 361.56, & 363.84. Again, no changes or accesses. However, the all archives deduplicated size remains a steady 28.97 megabytes throughout the three.

ThomasWaldmann commented 5 years ago

3kB or 300kB?

shadowrylander commented 5 years ago

You know, at this point I'm not entirely sure... I checked the this archive deduplicated sizes for the last 5 backups, and the first two were by around 1 kb, and the last three were by 300 kb! I legitimately don't know what's going on. Can I send you a pastebin of all the info about the sizes, i.e. is there a way to check all the this archive sizes at once, barring json manipulation?

ThomasWaldmann commented 5 years ago

1kb sounds good, 300kb not so much.

try borg diff maybe?

shadowrylander commented 5 years ago

Ah, I believe ownership of the files are changing, as I using a docker container as well; they are switching between the root user in the container whenever I'm backing up via docker, and my user whenever I'm backing up via WSL. Is there a way to ignore the ownership?

ThomasWaldmann commented 5 years ago

No.

On June 2, 2019 12:50:52 AM GMT+02:00, Jeet Ray notifications@github.com wrote:

Ah, I believe ownership of the files are changing, as I using a docker container as well; they are switching between the root user in the container whenever I'm backing up via docker, and my user whenever I'm backing up via WSL. Is there a way to ignore the ownership?

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/borgbackup/borg/issues/4594#issuecomment-497983478

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

shadowrylander commented 5 years ago

Fair enough! Are there any other factors I should keep in mind regarding deduplication, aside from the cache and file modification?

ThomasWaldmann commented 5 years ago

The file metadata ends up in the metadata stream - so having that dedup nicely requires the archived metadata to not change (not possible with ownership changing, atime could be ignored, bsdflags could also be ignored).

The file content data ends up in the content chunks, they'll dedup nicely if the content does not or only little change. Widespread "sprinkling" of little changes over a huge file can spoil that dedup process.

That's about it I guess.

shadowrylander commented 5 years ago

Perfect! Then I've all bases covered. Thank you kindly for help!