Open Frostregen opened 9 months ago
Hmm, IIRC borg always sequentially goes over the full archive, matching the pattern to each encountered item path and extracting it if it matches.
So it is very likely (considering the usual borg recursion behaviour at borg create
time) that initially the path pattern will not match (close to 0 transfer rate), then it will match for a while (high transfer rate, actually extracting stuff), then it will stop matching for the rest of the archive (close to 0 transfer rate).
The "close to zero" rate is when borg only reads metadata from the archive metadata stream, not file content data.
So, guess we could "short circuit" that and stop borg extract when it leaves the desired folder. That would only work for the most simple use case (as you had) and only for the current single-threaded borg internal recursor though.
Multi-threading (in future) and also an external recursor that behaves differently would be problematic.
So, could it be borg is already finished extracting the data you wanted?
An alternative explanation for widely varying transfer rate can be file size:
Yet another explanation of course can be outside factors:
Hmm, it seems strange that you see MAC authentication errors after running borg check --repair
- I would expect that the check finds and "fixes" (uses all-zero replacement chunks) all such errors, so you should not see MAC errors afterwards.
Maybe check if your borg client hardware (which runs the crypto) works OK, especially CPU and RAM, but also do a SMART test on the disks.
Update: I was slightly wrong:
--verify-data
if one wants that (and wait quite a bit longer)Another issue might be that you specified the extraction pattern with a leading slash.
borg archives store paths without a leading slash (like tar), so that might not match at all.
In some borg versions there were workarounds for this though, but later we removed them because they only worked for the most simple patterns, but not for more complex ones, see the change log.
Hi Thomas,
first of all thanks a lot for your quick replies and suggestions!
Thank you for confirming! Yes this was my suspicion as well, but it still "feels" slower compared to directly extracting everything. (The ~20% / ~14 TB until it fails were done in ~2 days)
I also checked the "DISK READ" rate on the server. That also shows no read activity when it is close to zero. If it still scans metadata, I would have expected more activity on the serverside? (Or even clientside if something is cached. But both seem to more or less idle. The only thing I notice in those "close to zero" periods is one Core on the client being at 100% - but I can not see this reflected in any processes CPU% (using htop).
Probably not. The used diskpace on the client is still very much below what I expect. And I can still see those occasional 100MB/s peaks for some minutes. Also I can still see the target folder growing (using du
)
I agree, this is a very common pattern seen when copying files too. The smallest files are in the range of 100KB to 500KB range. But usually much bigger. Even with such small files, I would expect to see more than the 20KB/s I'm seeing in the low activity phases.
I made sure of that. Even SCP'd a large file between the very same client and server from and to the very same RAID volume. -> Constant 100MB/s, while borg sits at 20KB/s. Using sudo htop
to look at disk usage. Also checked cat /proc/mdstat
to make sure no raid check is running.
The error happened twice at the same % value when extracting, so I'm pretty confident it is something reproducible, and not clientside CPU/RAM issue.
Is there actually a difference between borg check --repair
and borg check --verify-data --repair
?
Other remarks to MAC errors / repair:
borg check --repair
finished. Could it be it silently prematurely exited? At least from searching the github issue, I think there should have been some output to what it actually fixed?::2024-01-24-1706130008
I think I must be using a version with that workaround, as the extraction seems to work, files are populating where they should.
Could that leading slash be a cause of this slowdown? Or is it more like it just gets stripped from the pattern initially and thats it?
Thank you for your help so far!
I'm currently torn between:
borg check --repair
and extracting the full archive again.If borg unexpectedly terminates/crashes, you should see a traceback usually.
But as long as nothing bad happens, there might not be much output, because the default log level is warning (-v
aka --info
sets that to info, --debug
to debug).
Not sure if you are dealing with some bug here and I can't remember fixing anything related since 1.2.0, but you could install the fat 1.2.7 binary from github releases and give it a try.
Maybe it could be also some network issue, like e.g. a hanging connection?
Maybe we can send the running borg a signal to find out what it does.
About borg check: usually I let it verify all archives (so I only give the repo spec), but as you have a quite big repo (and maybe many archives), I can see why you don't want to do that. :-)
If you give --verify-data, it will cryptographically check the integrity of content data chunks also (otherwise it just checks if the referred content data chunks are present).
You could try sending SIGUSR1 and SIGUSR2 to the borg process.
That outputs some info and a traceback (but, IIRC, just continues after outputting that).
SIGUSR1 did not seem to print anything, but SIGUSR2 dumped this:
Received SIGUSR2 at 2024-02-07 08:59:29, dumping trace...
Current thread 0x00007f481c115000 (most recent call first):
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5049 in sig_trace_handler
File "/usr/lib/python3/dist-packages/borg/archive.py", line 812 in extract_item
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 907 in do_extract
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 198 in wrapper
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 183 in wrapper
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5020 in run
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5089 in main
File "/usr/bin/borg", line 33 in <module>
I will try the --repair on the repo spec, and hope it will take care if the integrity error.
Line 812 is borg writing content data while extracting a file.
Maybe try sending SIGUSR2 multiple times to see whether you get different locations in the code.
SIGUSR1 / SIGINFO should also print something IF borg is currently in extract_item
.
Maybe this is also helpful:
https://github.com/borgbackup/borg/issues/2419#issuecomment-345118768
I already cancelled the slow borg extract /sub/folder
now.
Currently I moved the clientside .config/borg
to the server and run the borg check --verify-data --repair --progress <repo spec>
locally. Let's see if this fixes the IntegrityErrors during full extract.
The full repair seems to have done something this time:
This is a potentially dangerous function.
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!
Type 'YES' if you understand this and want to continue: YES
chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273, integrity error: MAC Authentication failed
Found defect chunks. They will be deleted now, so affected files can get repaired now and maybe healed later.
Finished cryptographic data integrity verification, verified 32418621 chunks with 1 integrity errors.
2023-06-30-1688126422: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
2023-07-31-1690804824: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
2023-08-31-1693515612: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
2023-09-30-1696107615: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
2023-10-31-1698786013: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
2023-11-30-1701378002: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
2023-12-24-1703451615: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
2023-12-31-1704056415: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
2024-01-07-1704661215: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
2024-01-14-1705266028: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
2024-01-18-1705611601: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
2024-01-19-1705698018: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
2024-01-20-1705784428: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
2024-01-21-1705870807: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
2024-01-22-1705957248: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
2024-01-23-1706043602: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
2024-01-24-1706130008: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.
57 orphaned objects found!
Archive consistency check complete, problems found.
I'm trying a full extract now, hopefully not stopping at the previously corrupt file anymore.
@Frostregen yeah, extract should "work" after that (but that one file will have that internal damage, where borg replaced the corrupt chunk with an all-zero chunk).
If you still could reproduce the now missing chunk (by doing a borg create
that produces that chunk) and you would run another borg check --repair
after that, borg would even heal that file in the archives.
Luckily this specific file we can restore from another source, so this is fine. The critical point / file in borg extract has been passed, so I'm positive the rest will work fine now.
Thanks again for your support!
(Ticket can be closed from my point of view)
Ah, good. Then restore the file, do another backup with that file, then run borg check --repair again to heal the archives.
Reopening it for reviewing it later whether the experience with this issue can be improved.
As reported the specific corrupted file has been passed (around 21% of full extract).
My current issue is, it is getting extremely slow again. Meaning it sits at 33% for more than 24 hours now.
Disk Write speed is at 0.00 B/s
when watching it in sudo htop
I already tried to send SIGUSR1 and 2, but so far I do not see any output.
Any advice would be welcome, how to investigate or proceed here.
@Frostregen you have a lot of RAM in these machines, but you also have a rather big repo, so maybe check whether you do run out of RAM on client or server.
SIGUSR2 should always print a traceback.
SIGUSR1 only prints an information, if it finds the extract_item
function in the stack (then it will tell the file path and the seek position in that file).
Does your ssh/sshd setup use the ServerAlive ClientAlive feature (so it would automatically break down a non-functioning connection)?
I checked the RAM usage (and the journalctl logs, for any OOM-Killer invocation), but it seems completely fine. As you can see from the previous image, RAM usage is somewhere around 3GB (looks the same serverside). Very far away from available RAM.
Yesterday (2 days later) the stacktrace was actually printed out - as well as it continued to extract files (33% to 38%). So the SSH connection seems to be working.
Hmm, bit out of ideas here.
Maybe one could try to reproduce this issue on a developer machine and check if there is anything unusual going on, like getting stuck somehow. Finding the place where it is stuck would be useful already (that's what I hoped getting from signalling borg to get a traceback).
I suspect there is a bug somewhere (could be in borg or elsewhere).
Just as an update: It is currently at 77%. The finish line gets into sight :crossed_fingers:
I still do not have a good explanation for that behavior. Inspecting the kernel what it is currently busy with, points into ext4 allocation: While I found some mailing list posts about this potentially being an issue, I can not fully relate it to my current speed issue.
I will most likely create a new borg repo from scratch with the same data, and then test another full extract.
For the record:
During that time I followed the kernel stacktrace / ext4 allocation issue track:
Hope it helps someone else :smile:
@Frostregen Interesting. Now the question is whether this is just random or whether it always fails with kernel 6.5 and always succeeds with 5.15. The latter would indicate a regression bug related to kernel 6.5 (if everything else is the same).
I tested 2 times with kernel 5.15, and also multiple times with 6.5 with the same behavior, so I'm pretty positive it is related to this. According to the second link, the issue might have been fixed in kernel 6.8. I will re-test once Ubuntu24.04 is released in a few weeks.
Have you checked borgbackup docs, FAQ, and open GitHub issues?
Yes
Is this a BUG / ISSUE report or a QUESTION?
ISSUE
System information. For client/server mode post info for both machines.
Your borg version (borg -V).
Both client and server machine run: borg 1.2.0
Operating system (distribution) and version.
Client: Ubuntu 22.04.2 Server: Ubuntu 22.04.3
Hardware / network configuration, and filesystems used.
Client: 8C/16T CPU, 64GB RAM, 12x 20TB HDD in RAID6 Server: 6C/12T CPU, 128GB RAM, 10x 10TB HDD in RAID6
How much data is handled by borg?
Full borg commandline that lead to the problem (leave away excludes and passwords)
borg extract --progress ssh://<USER>@<IP>:<PORT>/path/to/backup/::2024-01-24-1706130008 /path/into/subfolder
Describe the problem you're observing.
Initially (on the first day) when extracting, the speed is as expected around 100MB/s (on a 1 Gbit/s line).
After some time the speed crawls down to 20KB/s, or even 0. Rarely it also goes back up to around 100MB/s.
This makes it look like it will take assumed ~50 days, instead of the anticipated ~2 days. (The subfolder should be around 10 TB)
I'm suspecting this has to do with specifying the "/path/into/subfolder". Compared to running a plain borg extract of the full archive (no subfolder specified), I could not observe that drop in speed.
The "fun" part is, that the repository also has some Integrity errors, so after ~21% of running a full extract (
borg extract --progress ssh://<repopath>::2024-01-24-1706130008
), I get:This is why I tried to extract subfolders at all.
I ran
borg check --verify-data
before too, which gave me:Followed by a
borg check --repair --progress ssh://<repopath>::2024-01-24-1706130008
- But I still got the same IntegrityError when trying to extract everything afterwards.Would a
borg check --repair --verify-data ...
actually fix that issue?I am a bit lost currently, and would be happy for any pointers on how to proceed.
Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.
Yes. Always happens in a similar fashion. (Cancel and retry). Also tried borg mount + rsync, but that was similarly slow.
Include any warning/errors/backtraces from the system logs
Already above