borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11.19k stars 742 forks source link

"Analyze" function to find (and remove) missed non-dedupable temp/cache hotspots #71

Closed jumper444 closed 1 month ago

jumper444 commented 9 years ago

My previous issue post was to ask if a 'delete' command modification is possible to remove individual files or directories from within one or more archives (or entire repository). The feature discussed below is a method of finding non-dedupable 'hotspots' in backups (which would typically be missed/hidden cache or temp files) then deleting them to reclaim space.

I suggest consideration of a command such as "analyze" working on a repository level (or multiple archives..the more the better). This command would look for two things:

1) Files (of fixed name and directory location) which, over multiple backups, have an extremely high non-dedupable ratio of data vs their size.

2) Directories (of fixed name and location) which, over multiple backups, have a very high ratio of non-dedupable data vs their size.

You can see that such a scan/analyze will immediately reveal accidentally missed swap files, temp files, and temp directories. An administrator can use this command to search for (and upon further analysis) find and delete this data.

In the first case (1) if the file name and location stay the same between archives and yet the file keeps changing so every backup it has a massively high amount of new data chunks then almost certainly you've found some sort of temp file whose deletion from the backup will reclaim a large amount of space. For example, on backups of windows machines this test case would find "pagefile.sys" as being a huge redflag (windows swap file). Obviously note it isn't in a 'cache' directory and doesn't have a .TMP extension...yet this file is not necessary to backup and it's exclusion (or deletion post-backup with 'delete' command) would allow massive size savings.

Case (2) is where you have temp files such that the names of the files keep changing randomly (so case (1) won't work) but the location doesn't change. This would find hotspots like "c:\window\temp"...again something that could be deleted and reclaimed from a backup database. (In this case the exclusion is clearly labeled 'temp' but this was just the first example I could think of. There are multiple instances on computers of temp directories using random file names which don't immediately become noticed by looking at their name.)

The analyze command specific parameters would need some testing to determine what to display and how to calc/display it. And any results would require further manual inspection before going off and deleting things obviously. But such a feature would do a good job of highlighting missed hotspots in large or complex backups.

Thoughts?

ThomasWaldmann commented 9 years ago

Interesting idea, but quite some effort to implement. So the question is whether you can't simply find these files/directories by looking at a --verbose log output of the 2nd+ backup. borg tells U there for unchanged files, A for added files.

RonnyPfannschmidt commented 8 years ago

i am interested in aiding this (as this is what i currently want to do) in combination with recreate --exclude it can be used to clean up backups iteratively

ThomasWaldmann commented 1 month ago

This idea depends on only analysing the archives that contain basically the same data set at different points in time.

For borg 1.x that would mean some pattern matching on the archive name (like -a), for borg 2 it could also use archive series (identical archive names).

ThomasWaldmann commented 1 month ago

8436 is a start.

ThomasWaldmann commented 1 month ago

@jumper444 @RonnyPfannschmidt can you review the PR / give feedback?