borg extract (subfolder) very slow

Frostregen commented 9 months ago

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

ISSUE

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

Both client and server machine run: borg 1.2.0

Operating system (distribution) and version.

Client: Ubuntu 22.04.2 Server: Ubuntu 22.04.3

Hardware / network configuration, and filesystems used.

Client: 8C/16T CPU, 64GB RAM, 12x 20TB HDD in RAID6 Server: 6C/12T CPU, 128GB RAM, 10x 10TB HDD in RAID6

How much data is handled by borg?

------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:               66.93 TB             52.38 TB              1.64 TB
All archives:                1.06 PB            836.84 TB             44.05 TB

                       Unique chunks         Total chunks
Chunk index:                32418600            632333526
------------------------------------------------------------------------------

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg extract --progress ssh://<USER>@<IP>:<PORT>/path/to/backup/::2024-01-24-1706130008 /path/into/subfolder

Describe the problem you're observing.

Initially (on the first day) when extracting, the speed is as expected around 100MB/s (on a 1 Gbit/s line).

After some time the speed crawls down to 20KB/s, or even 0. Rarely it also goes back up to around 100MB/s.

This makes it look like it will take assumed ~50 days, instead of the anticipated ~2 days. (The subfolder should be around 10 TB)

I'm suspecting this has to do with specifying the "/path/into/subfolder". Compared to running a plain borg extract of the full archive (no subfolder specified), I could not observe that drop in speed.

The "fun" part is, that the repository also has some Integrity errors, so after ~21% of running a full extract (borg extract --progress ssh://<repopath>::2024-01-24-1706130008), I get:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5089, in main
    exit_code = archiver.run(args)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5020, in run
    return set_ec(func(args))
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 183, in wrapper
    return method(self, args, repository=repository, **kwargs)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 198, in wrapper
    return method(self, args, repository=repository, manifest=manifest, key=key, archive=archive, **kwargs)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 907, in do_extract
    archive.extract_item(item, stdout=stdout, sparse=sparse, hardlink_masters=hardlink_masters,
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 804, in extract_item
    for data in self.pipeline.fetch_many(ids, is_preloaded=True):
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 338, in fetch_many
    yield self.key.decrypt(id_, data)
  File "/usr/lib/python3/dist-packages/borg/crypto/key.py", line 380, in decrypt
    payload = self.cipher.decrypt(data)
  File "src/borg/crypto/low_level.pyx", line 326, in borg.crypto.low_level.AES256_CTR_BASE.decrypt
  File "src/borg/crypto/low_level.pyx", line 444, in borg.crypto.low_level.AES256_CTR_BLAKE2b.mac_verify
borg.crypto.low_level.IntegrityError: MAC Authentication failed

This is why I tried to extract subfolders at all.

I ran borg check --verify-data before too, which gave me:

borg check --verify-data --progress ssh://<repopath>::2024-01-24-1706130008
Remote: Index object count mismatch.
Remote: committed index: 32418586 objects
Remote: rebuilt index:   32418590 objects
Remote: ID: 1bdb6a802aa5122a7b4df38fe2174a874b7c30c0a7be9a722d7cea86d1aa6712 rebuilt index: (53554, 8)       committed index: <not found>     
Remote: ID: 106a0f95ff82f19e5c753302f62034bdb122cc90ed81e57e05783fd7f7d060c8 rebuilt index: (34806, 137419498) committed index: <not found>     
Remote: ID: b54154794ab4fbccce0af4d4c538fcba09ea281de747a3945754ec36c9efd312 rebuilt index: (48625, 151966847) committed index: <not found>     
Remote: ID: 51098c822edc1631f8c80c4f889f28320cac66b28111123033800e901748621f rebuilt index: (46567, 11881867) committed index: <not found>     
Remote: Finished full repository check, errors found.

Followed by a borg check --repair --progress ssh://<repopath>::2024-01-24-1706130008 - But I still got the same IntegrityError when trying to extract everything afterwards.

Would a borg check --repair --verify-data ... actually fix that issue?

I am a bit lost currently, and would be happy for any pointers on how to proceed.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Yes. Always happens in a similar fashion. (Cancel and retry). Also tried borg mount + rsync, but that was similarly slow.

Include any warning/errors/backtraces from the system logs

Already above

ThomasWaldmann commented 9 months ago

Hmm, IIRC borg always sequentially goes over the full archive, matching the pattern to each encountered item path and extracting it if it matches.

So it is very likely (considering the usual borg recursion behaviour at borg create time) that initially the path pattern will not match (close to 0 transfer rate), then it will match for a while (high transfer rate, actually extracting stuff), then it will stop matching for the rest of the archive (close to 0 transfer rate).

The "close to zero" rate is when borg only reads metadata from the archive metadata stream, not file content data.

So, guess we could "short circuit" that and stop borg extract when it leaves the desired folder. That would only work for the most simple use case (as you had) and only for the current single-threaded borg internal recursor though.

Multi-threading (in future) and also an external recursor that behaves differently would be problematic.

So, could it be borg is already finished extracting the data you wanted?

An alternative explanation for widely varying transfer rate can be file size:

if borg extracts big files, you'll see a high transfer rate, because there is minimal overhead and content size determines processing time.
if borg extracts very small files, you'll see a much lower transfer rate, because overhead determines per-file processing time.

Yet another explanation of course can be outside factors:

your network connection being used otherwise also
your disks being actively used otherwise also

ThomasWaldmann commented 9 months ago

Hmm, it seems strange that you see MAC authentication errors after running borg check --repair - I would expect that the check finds and "fixes" (uses all-zero replacement chunks) all such errors, so you should not see MAC errors afterwards.

Maybe check if your borg client hardware (which runs the crypto) works OK, especially CPU and RAM, but also do a SMART test on the disks.

Update: I was slightly wrong:

check --repair does not cryptographically verify (== check the MAC) file content data, one needs to add --verify-data if one wants that (and wait quite a bit longer)
but simple cases of corruption can usually be also found by the repository check checking the crc32 of all chunks. in that case, the chunk would be removed and the files would be "patched" with all-zero replacement chunk.
looks like this issue was a case of crc32 check being "OK", but the MAC (or the data) being wrong and thus failing the MAC test.

ThomasWaldmann commented 9 months ago

Another issue might be that you specified the extraction pattern with a leading slash.

borg archives store paths without a leading slash (like tar), so that might not match at all.

In some borg versions there were workarounds for this though, but later we removed them because they only worked for the most simple patterns, but not for more complex ones, see the change log.

Frostregen commented 9 months ago

Hi Thomas,

first of all thanks a lot for your quick replies and suggestions!

borg always sequentially goes over the full archive

Thank you for confirming! Yes this was my suspicion as well, but it still "feels" slower compared to directly extracting everything. (The ~20% / ~14 TB until it fails were done in ~2 days)

The "close to zero" rate is when borg only reads metadata from the archive metadata stream

I also checked the "DISK READ" rate on the server. That also shows no read activity when it is close to zero. If it still scans metadata, I would have expected more activity on the serverside? (Or even clientside if something is cached. But both seem to more or less idle. The only thing I notice in those "close to zero" periods is one Core on the client being at 100% - but I can not see this reflected in any processes CPU% (using htop).

So, could it be borg is already finished extracting the data you wanted?

Probably not. The used diskpace on the client is still very much below what I expect. And I can still see those occasional 100MB/s peaks for some minutes. Also I can still see the target folder growing (using du)

if borg extracts very small files, you'll see a much lower transfer rate

I agree, this is a very common pattern seen when copying files too. The smallest files are in the range of 100KB to 500KB range. But usually much bigger. Even with such small files, I would expect to see more than the 20KB/s I'm seeing in the low activity phases.

Network and Disk usage

I made sure of that. Even SCP'd a large file between the very same client and server from and to the very same RAID volume. -> Constant 100MB/s, while borg sits at 20KB/s. Using sudo htop to look at disk usage. Also checked cat /proc/mdstat to make sure no raid check is running.

MAC auth errors after borg check --repair

The error happened twice at the same % value when extracting, so I'm pretty confident it is something reproducible, and not clientside CPU/RAM issue.

Is there actually a difference between borg check --repair and borg check --verify-data --repair?

Other remarks to MAC errors / repair:

I did not get any output when borg check --repair finished. Could it be it silently prematurely exited? At least from searching the github issue, I think there should have been some output to what it actually fixed?
Is it correct / a difference to specify the Archive path for --repair / --verify-data? Or just the repo in general? For --verify-data and also --repair I specified the Archive ::2024-01-24-1706130008

extraction pattern with a leading slash.

I think I must be using a version with that workaround, as the extraction seems to work, files are populating where they should.

Could that leading slash be a cause of this slowdown? Or is it more like it just gets stripped from the pattern initially and thats it?

Thank you for your help so far!

I'm currently torn between:

a) leaving the extraction continue with this slow speed
b) Stopping it. Running another borg check --repair and extracting the full archive again.

ThomasWaldmann commented 9 months ago

If borg unexpectedly terminates/crashes, you should see a traceback usually.

But as long as nothing bad happens, there might not be much output, because the default log level is warning (-v aka --info sets that to info, --debug to debug).

Not sure if you are dealing with some bug here and I can't remember fixing anything related since 1.2.0, but you could install the fat 1.2.7 binary from github releases and give it a try.

Maybe it could be also some network issue, like e.g. a hanging connection?

Maybe we can send the running borg a signal to find out what it does.

ThomasWaldmann commented 9 months ago

About borg check: usually I let it verify all archives (so I only give the repo spec), but as you have a quite big repo (and maybe many archives), I can see why you don't want to do that. :-)

If you give --verify-data, it will cryptographically check the integrity of content data chunks also (otherwise it just checks if the referred content data chunks are present).

ThomasWaldmann commented 9 months ago

You could try sending SIGUSR1 and SIGUSR2 to the borg process.

That outputs some info and a traceback (but, IIRC, just continues after outputting that).

Frostregen commented 9 months ago

SIGUSR1 did not seem to print anything, but SIGUSR2 dumped this:

Received SIGUSR2 at 2024-02-07 08:59:29, dumping trace...
Current thread 0x00007f481c115000 (most recent call first):
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5049 in sig_trace_handler
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 812 in extract_item
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 907 in do_extract
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 198 in wrapper
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 183 in wrapper
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5020 in run
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5089 in main
  File "/usr/bin/borg", line 33 in <module>

I will try the --repair on the repo spec, and hope it will take care if the integrity error.

ThomasWaldmann commented 9 months ago

Line 812 is borg writing content data while extracting a file.

Maybe try sending SIGUSR2 multiple times to see whether you get different locations in the code.

ThomasWaldmann commented 9 months ago

SIGUSR1 / SIGINFO should also print something IF borg is currently in extract_item.

Maybe this is also helpful:

https://github.com/borgbackup/borg/issues/2419#issuecomment-345118768

Frostregen commented 9 months ago

I already cancelled the slow borg extract /sub/folder now.

Currently I moved the clientside .config/borg to the server and run the borg check --verify-data --repair --progress <repo spec> locally. Let's see if this fixes the IntegrityErrors during full extract.

Frostregen commented 9 months ago

Update

The full repair seems to have done something this time:

This is a potentially dangerous function.
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!

Type 'YES' if you understand this and want to continue: YES
chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273, integrity error: MAC Authentication failed
Found defect chunks. They will be deleted now, so affected files can get repaired now and maybe healed later.
Finished cryptographic data integrity verification, verified 32418621 chunks with 1 integrity errors.
2023-06-30-1688126422: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
2023-07-31-1690804824: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
2023-08-31-1693515612: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
2023-09-30-1696107615: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
2023-10-31-1698786013: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
2023-11-30-1701378002: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
2023-12-24-1703451615: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
2023-12-31-1704056415: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
2024-01-07-1704661215: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
2024-01-14-1705266028: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
2024-01-18-1705611601: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
2024-01-19-1705698018: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
2024-01-20-1705784428: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
2024-01-21-1705870807: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
2024-01-22-1705957248: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
2024-01-23-1706043602: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
2024-01-24-1706130008: data/xyz.zip: New missing file chunk detected (Byte 447908978693-447910088265, Chunk 92b1f06092a33feb486aff54d075de18ed072ff2f4e265826b3e7df9deaf6273). Replacing with all-zero chunk.      
57 orphaned objects found!
Archive consistency check complete, problems found.

I'm trying a full extract now, hopefully not stopping at the previously corrupt file anymore.

ThomasWaldmann commented 9 months ago

@Frostregen yeah, extract should "work" after that (but that one file will have that internal damage, where borg replaced the corrupt chunk with an all-zero chunk).

If you still could reproduce the now missing chunk (by doing a borg create that produces that chunk) and you would run another borg check --repair after that, borg would even heal that file in the archives.

Frostregen commented 9 months ago

Luckily this specific file we can restore from another source, so this is fine. The critical point / file in borg extract has been passed, so I'm positive the rest will work fine now.

Thanks again for your support!

(Ticket can be closed from my point of view)

ThomasWaldmann commented 9 months ago

Ah, good. Then restore the file, do another backup with that file, then run borg check --repair again to heal the archives.

ThomasWaldmann commented 9 months ago

Reopening it for reviewing it later whether the experience with this issue can be improved.

Frostregen commented 9 months ago

As reported the specific corrupted file has been passed (around 21% of full extract).

My current issue is, it is getting extremely slow again. Meaning it sits at 33% for more than 24 hours now. Disk Write speed is at 0.00 B/s when watching it in sudo htop

I already tried to send SIGUSR1 and 2, but so far I do not see any output.

Any advice would be welcome, how to investigate or proceed here.

ThomasWaldmann commented 9 months ago

@Frostregen you have a lot of RAM in these machines, but you also have a rather big repo, so maybe check whether you do run out of RAM on client or server.

SIGUSR2 should always print a traceback. SIGUSR1 only prints an information, if it finds the extract_item function in the stack (then it will tell the file path and the seek position in that file).

Does your ssh/sshd setup use the ServerAlive ClientAlive feature (so it would automatically break down a non-functioning connection)?

Frostregen commented 9 months ago

I checked the RAM usage (and the journalctl logs, for any OOM-Killer invocation), but it seems completely fine. As you can see from the previous image, RAM usage is somewhere around 3GB (looks the same serverside). Very far away from available RAM.

Yesterday (2 days later) the stacktrace was actually printed out - as well as it continued to extract files (33% to 38%). borg_trace So the SSH connection seems to be working.

ThomasWaldmann commented 8 months ago

Hmm, bit out of ideas here.

Maybe one could try to reproduce this issue on a developer machine and check if there is anything unusual going on, like getting stuck somehow. Finding the place where it is stuck would be useful already (that's what I hoped getting from signalling borg to get a traceback).

ThomasWaldmann commented 8 months ago

I suspect there is a bug somewhere (could be in borg or elsewhere).

Frostregen commented 8 months ago

Just as an update: It is currently at 77%. The finish line gets into sight :crossed_fingers:

I still do not have a good explanation for that behavior. Inspecting the kernel what it is currently busy with, points into ext4 allocation: While I found some mailing list posts about this potentially being an issue, I can not fully relate it to my current speed issue.

I will most likely create a new borg repo from scratch with the same data, and then test another full extract.

Frostregen commented 7 months ago

For the record:

It got stuck again at 78.1% for more than 3 weeks (and never finished, as another approach finished first)

During that time I followed the kernel stacktrace / ext4 allocation issue track:

References:
- https://www.spinics.net/lists/raid/msg75409.html
- https://bugzilla.kernel.org/show_bug.cgi?id=217965
Set up another server with a copy of the borg repository
I could reproduced the slow / stuck behavior using the same 6.5 kernel (Extracting locally)
Downgrading to a 5.15 kernel did not expose any of this slowness / getting stuck (Extracting locally), which finally lead to a successful full extract and data restore

Hope it helps someone else :smile:

ThomasWaldmann commented 7 months ago

@Frostregen Interesting. Now the question is whether this is just random or whether it always fails with kernel 6.5 and always succeeds with 5.15. The latter would indicate a regression bug related to kernel 6.5 (if everything else is the same).

Frostregen commented 7 months ago

I tested 2 times with kernel 5.15, and also multiple times with 6.5 with the same behavior, so I'm pretty positive it is related to this. According to the second link, the issue might have been fixed in kernel 6.8. I will re-test once Ubuntu24.04 is released in a few weeks.

borgbackup / borg