borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11.23k stars 743 forks source link

Request: analyze which files/directories are using the most storage #8074

Closed MatthewL246 closed 1 month ago

MatthewL246 commented 9 months ago

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Yes.

Is this a BUG / ISSUE report or a QUESTION?

Feature request.

System information

N/A

Feature request

I think it would be useful if Borg could generate a list showing which files and directories have been using the most storage space (after compression and deduplication) in a repo within a certain time period (such as in the last month). This would be helpful for finding directories that are wasting space in the repo and the user might have accidentally forgotten to exclude.

My inspiration for this is the git-filter-repo --analyze option, which creates a report of which files in a Git repo have used the most space throughout the repo's history. A borg analyze command could look something like that.

Example `git-filter-repo` analysis for the Borg repo ``` === All directories by reverse size === Format: unpacked size, packed size, date deleted, directory name 744008017 26787700 517286397 12653532 src/borg 517286397 12653532 src 104691374 10788204 docs 11934188 5279538 docs/internals 16803244 2387797 src/borg/algorithms 115776607 2116171 src/borg/testsuite 13993693 1921949 src/borg/algorithms/zstd/lib 13993693 1921949 src/borg/algorithms/zstd 80398937 1552724 borg 2621210 1193969 docs/misc 11735486 893532 docs/man 9106300 720133 docs/usage 3752583 686089 src/borg/algorithms/zstd/lib/compress 5965792 479344 src/borg/algorithms/zstd/lib/legacy 15161225 435674 attic 1631753 358329 docs/misc/asciinema 17886036 333528 borg/testsuite 7417804 322251 src/borg/helpers 4493375 267651 2013-07-09 darc 1256006 246006 src/borg/algorithms/zstd/lib/common 1535426 240287 src/borg/algorithms/xxh64 9980617 239345 src/borg/archiver 10038085 212145 src/borg/testsuite/archiver 1234524 204819 src/borg/algorithms/zstd/lib/decompress 7281292 200862 src/borg/crypto 2799285 163539 scripts 747157 155625 src/borg/algorithms/lz4/lib 747157 155625 src/borg/algorithms/lz4 2624369 141717 scripts/shell_completions 862668 125015 src/borg/algorithms/zstd/lib/dictBuilder 967385 109191 2019-05-13 src/borg/_msgpack 987519 103475 2010-10-27 dedupestore 1502405 93505 src/borg/platform 1524670 74299 scripts/shell_completions/zsh 2475851 59442 attic/testsuite 596814 46152 docs/deployment 407120 39587 docs/usage/general 874383 39206 scripts/shell_completions/fish 225316 28212 scripts/shell_completions/bash 92775 27211 docs/_static 189831 26596 2010-03-01 dedupstore 414837 25466 .github 410788 24095 .github/workflows 142992 23397 2017-05-02 src/borg/_crc32 69599 19679 2021-01-28 src/borg/algorithms/blake2 90056 19502 2016-01-24 borg/support 354626 18382 src/borg/cache_sync 77575 16404 src/borg/algorithms/zstd/lib/deprecated 84075 15074 2020-12-21 .travis 222060 12566 src/borg/algorithms/msgpack 41389 11046 2021-01-28 src/borg/algorithms/blake2/ref 28210 8642 src/borg/blake2 17281 8199 requirements.d 70103 8080 docs/borg_theme/css 70103 8080 docs/borg_theme 65232 6249 2015-10-12 docs/_themes 40066 5064 deployment/windows 40066 5064 deployment 104163 4258 2013-07-09 darc/testsuite 11638 3982 docs/3rd_party 53338 3683 2015-10-12 docs/_themes/local 7968 3133 2022-02-27 docs/3rd_party/blake2 11894 2566 2015-05-13 docs/_themes/attic 45939 2393 2015-10-12 docs/_themes/local/static 9171 1553 2015-05-13 docs/_themes/attic/static 2012 765 scripts/fuzz-cache-sync 1608 735 scripts/make-testdata 3235 661 2010-10-31 doc 1032 608 docs/_templates 1530 451 2022-02-26 docs/3rd_party/zstd 328 269 2013-06-24 fake_pyrex 231 177 2013-06-24 fake_pyrex/Pyrex 204 142 2013-06-24 fake_pyrex/Pyrex/Distutils 266 124 scripts/fuzz-cache-sync/testcase_dir 614 117 docs/3rd_party/msgpack 1311 110 2022-02-26 docs/3rd_party/lz4 ```

It would also be interesting to see a feature that does something similar for "time spent backing up" instead of storage used, although I don't know if that would be feasible.

ThomasWaldmann commented 9 months ago

Borg does not yet have such a feature, but guess it would be possible to implement the space-usage analysis.

It is not possible to analyse the time spent for backing up some file/dir, we only have the overall backup time for a backup archive, but no more fine-granular timing data.

Implementation notes:

MatthewL246 commented 9 months ago

Since it sounds like individual file timing isn't implemented, I made a quick Python script that ranks directories on their backup times in case anyone else finds that useful. It requires a timestamped backup log, which can be generated with borg create --list ... | ts -s "%.s" | tee borg_log.txt.

from collections import defaultdict

path_backup_times = defaultdict(float)

with open("borg_log.txt", "r") as file:
    previous_timestamp = 0
    for line in file:
        parts = line.split()
        if len(parts) >= 3:
            timestamp = float(parts[0])
            file_flag = parts[1]
            file_path = " ".join(parts[2:])

            # See https://borgbackup.readthedocs.io/en/latest/usage/create.html#item-flags
            if file_flag in ["A", "M", "U", "C", "E"]:
                backup_time = timestamp - previous_timestamp
                path_components = file_path.split("/")
                for i in range(1, len(path_components) + 1):
                    component = "/".join(path_components[:i])
                    path_backup_times[component] += backup_time

            previous_timestamp = timestamp

sorted_paths = sorted(path_backup_times.items(), key=lambda x: x[1], reverse=True)[:20]

for rank, (path, backup_time) in enumerate(sorted_paths, start=1):
    print(f"{rank}. {path} ({round(backup_time)}s)")
ThomasWaldmann commented 1 month ago

Idea: borg2 compact needs to read all archives anyway, so could compute some stats as a side effect.

ThomasWaldmann commented 1 month ago

Related: #71

ThomasWaldmann commented 1 month ago

borg2 beta 12 now has "borg analyze", read the docs about what it does precisely.