borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11.2k stars 742 forks source link

borg2: build_chunkindex_from_repo is slow #8397

Closed ThomasWaldmann closed 1 month ago

ThomasWaldmann commented 1 month ago

Problem:

That function does a repository.list(), listing all the object IDs in the repo to build an in-memory chunkindex.

Because all objects are stored separately into a 2 levels deep dir structure, that are (1+)256+65536 listdir() calls in the worst case. Depending on store speed, connection latency, etc., that can take quite a while.

The in-memory chunkindex is currently not persisted to local cache.

ThomasWaldmann commented 1 month ago

Analysis:

There are only a few borg2 commands that remove objects from data/ in the store:

Notably, these commands do NOT delete objects from data/:

So, the set of objects in data/ is always increasing until compact/check is run (we can ignore borg debug and borg repo-delete).

borg create must not assume a chunk is in the repo when it in fact isn't anymore, that would create a corrupt archive, referencing a non-existing object.

OTOH, storing a chunk into the repo that already exists in there (but we did not know) is only a performance issue, but otherwise not a problem.

ThomasWaldmann commented 1 month ago

Implementation idea:

Uptodate check and lockless operation (even if multiple borg of same user on same machine use the same repository) needs more thoughts.

ThomasWaldmann commented 1 month ago

Another idea:

SpiritInAShell commented 1 month ago

(I do not have any deep understanding of the internal structures so when assuming the following, I am only poking around.)

(borg2 beta10) (Ok, I see that this is closed. I will try the next beta as soon as possible.)

This is tested on a Hetzner storagebox. Accessing from a 100/50MBit Telekom fibre connection.

If I understand correctly, the process in this issue is logged as following?

{
  "type": "log_message",
  "time": 1727224246.2930982,
  "message": "[chan 0] listdir(b'/repo.borg2beta10/data/a3/9e')",
  "levelname": "DEBUG",
  "name": "paramiko.transport.sftp"
}

As far as I know (asked ChatGPT) the SFTP protocol does not have a "get me all subdirs". This process is painful slow.

The repo size is about 11GiB.

I mounted the remote directory with "sshfs" and did a time find and aborted it after 10 minutes (doing another run right now) which took 451 seconds (not knowing if sshfs used a cache when doing the second run)

I did a time rsync -avPi [user@server]:/[remote path] (same connection, output is similar to a find -ls listing) and that took about 8 seconds

As every directory read has a significant overhead: what would be if the data chunks were put in numerical directories, until a direcory is "full" or "saturated" with a defined number of files? Would reading a few directories with many files improve the process?

I see that the chunks are sorted into subdirs related to their filename's first 2 chars and that will be for a reason. But is it worth the read overhead of sftp protocol? In the end, you create a local client-side index anyway. Therefor you know which chunk is in which directory.

Maybe other access protocols like rclone (which I do not know at all) will not have these limitations and therefor the subdir sctructure will be an advantage.

ThomasWaldmann commented 1 month ago

@SpiritInAShell I just merged some improvements, so please re-test with current master branch (best is to create a fresh repo).

ThomasWaldmann commented 1 month ago

The problem with sftp is described there: https://github.com/borgbackup/borgstore/issues/44

The only way to speed this up is to do less requests, which is what current master branch / next beta will do.