Closed ThomasWaldmann closed 1 month ago
Analysis:
There are only a few borg2 commands that remove objects from data/
in the store:
Notably, these commands do NOT delete objects from data/
:
archives/
)So, the set of objects in data/ is always increasing until compact/check is run (we can ignore borg debug and borg repo-delete).
borg create must not assume a chunk is in the repo when it in fact isn't anymore, that would create a corrupt archive, referencing a non-existing object.
OTOH, storing a chunk into the repo that already exists in there (but we did not know) is only a performance issue, but otherwise not a problem.
Implementation idea:
Uptodate check and lockless operation (even if multiple borg of same user on same machine use the same repository) needs more thoughts.
Another idea:
repository.list()
anyway (and has an exclusive lock), so it could build the list of objects ids (after it deleted objects due to garbage collection) and store them into the repository (cache/*
maybe?) and also to local cache.(I do not have any deep understanding of the internal structures so when assuming the following, I am only poking around.)
(borg2 beta10) (Ok, I see that this is closed. I will try the next beta as soon as possible.)
This is tested on a Hetzner storagebox. Accessing from a 100/50MBit Telekom fibre connection.
If I understand correctly, the process in this issue is logged as following?
{
"type": "log_message",
"time": 1727224246.2930982,
"message": "[chan 0] listdir(b'/repo.borg2beta10/data/a3/9e')",
"levelname": "DEBUG",
"name": "paramiko.transport.sftp"
}
As far as I know (asked ChatGPT) the SFTP protocol does not have a "get me all subdirs". This process is painful slow.
The repo size is about 11GiB.
I mounted the remote directory with "sshfs" and did a time find
and aborted it after 10 minutes (doing another run right now) which took 451 seconds (not knowing if sshfs used a cache when doing the second run)
I did a time rsync -avPi [user@server]:/[remote path]
(same connection, output is similar to a find -ls
listing) and that took about 8 seconds
As every directory read has a significant overhead: what would be if the data chunks were put in numerical directories, until a direcory is "full" or "saturated" with a defined number of files? Would reading a few directories with many files improve the process?
I see that the chunks are sorted into subdirs related to their filename's first 2 chars and that will be for a reason. But is it worth the read overhead of sftp protocol? In the end, you create a local client-side index anyway. Therefor you know which chunk is in which directory.
Maybe other access protocols like rclone (which I do not know at all) will not have these limitations and therefor the subdir sctructure will be an advantage.
@SpiritInAShell I just merged some improvements, so please re-test with current master branch (best is to create a fresh repo).
The problem with sftp is described there: https://github.com/borgbackup/borgstore/issues/44
The only way to speed this up is to do less requests, which is what current master branch / next beta will do.
Problem:
That function does a
repository.list()
, listing all the object IDs in the repo to build an in-memory chunkindex.Because all objects are stored separately into a 2 levels deep dir structure, that are (1+)256+65536 listdir() calls in the worst case. Depending on store speed, connection latency, etc., that can take quite a while.
The in-memory chunkindex is currently not persisted to local cache.