Closed jdchristensen closed 10 months ago
Probably not much of help to you guys. But I am seeing these errors as well, quite often. I backup to two hosts (on-site, off-site), both servers use borg 1.2.2.
Tonight a weekly check of two of the repositories failed (same source, different destinations). When I ran a --repair
I found these, as usual:
16 orphaned objects found!
Deleting 16 orphaned and 266 superseded objects...
and
6 orphaned objects found!
Deleting 6 orphaned and 321 superseded objects...
The client uses borg 1.1.18. The backup script doesn't run any compact, only create, prune and check.
From what I know the borg processes hasn't even received any kill signals or been interrupted in other ways since the last repair. But I could be wrong there. I also quite often run another check after a repair, which has always been successful. In other words, it's not the same error that persists, it's new orphaned objects that shows up.
I don't think I ever saw this issue before upgrading the server to 1.2.x.
Note: borg 1.1.x clients implicitly run the same segment compaction (as triggered by borg compact
with 1.2 clients) automatically when writing to the repo.
same kind of problems here:
server to backup is using 1.1.15 server receiving backups is using 1.2.0
I just started to hit this problem with the local backups. Repo was created with 1.1.17, issues started to happen after upgrade to 1.20
Repairing seems to have fixed it for now[...]
Some follow-up from me: Whenever I hit the error in the past, I repaired the corresponding repo and since end of October last year I didn't encounter the issue at all anymore. Before that it hit me every few days, so it seems Borg is able to fix things once and for all at some point. Might be an additional hint for that I introduced some error into the repo when I reverted my ZFS snapshots and something new in Borg 1.2 got triggered by that or recognized that problem first.
Repairing seems to have fixed it for now[...]
Some follow-up from me: Whenever I hit the error in the past, I repaired the corresponding repo and since end of October last year I didn't encounter the issue at all anymore. Before that it hit me every few days, so it seems Borg is able to fix things once and for all at some point. Might be an additional hint for that I introduced some error into the repo when I reverted my ZFS snapshots and something new in Borg 1.2 got triggered by that or recognized that problem first.
Not true for me. I have two servers creating multiple backups in the same repository. One running on Tuesdays, the other on Wednesdays. After every 2nd-4th (might be after one as well, I am not 100% certain) time the check fails afterwards. I always repair it as soon as I see the errors, and I also run a check afterwards to double check that it's clean, and it has always been.
Example from yesterday:
1 orphaned objects found!
Deleting 1 orphaned and 142 superseded objects...
Finished deleting orphaned/superseded objects.
Writing Manifest.
Committing repo.
Archive consistency check complete, problems found.
These two servers actually have two destinations. For some reason I am getting "corruption" on one of them more often than the other. I can't explain why. The servers are using the same borg version. It's heavily used servers with a lot of action and no other issues, ecc of course. I run create, prune and compact towards both of them just as often.
I have also hit the same problem (or very similar): 1) I was using borgbackup 1.1.15 2) Stopped borgbackup, and installed 1.2.3 3) Ran one create and prune, followed by a check. No errors. 4) Did "compact --cleanup-commits", followed by a check. Now I got this error:
Index object count mismatch.
committed index: 1413156 objects
rebuilt index: 1413157 objects
ID: 28c65c17ab1969f61b740e477e560594c2971cb0457301410f49469b81596c99 rebuilt index: (6478, 272196528) committed index: <not found>
Finished full repository check, errors found.
Trying to do a check --repair
now to see if it fixes this (takes some times due to large repo and slooooow disks ☹️ )
I have also hit the same problem (or very similar):
- I was using borgbackup 1.1.15
- Stopped borgbackup, and installed 1.2.3
- Ran one create and prune, followed by a check. No errors.
- Did "compact --cleanup-commits", followed by a check. Now I got this error:
Index object count mismatch. committed index: 1413156 objects rebuilt index: 1413157 objects ID: 28c65c17ab1969f61b740e477e560594c2971cb0457301410f49469b81596c99 rebuilt index: (6478, 272196528) committed index: <not found> Finished full repository check, errors found.
Trying to do a
check --repair
now to see if it fixes this (takes some times due to large repo and slooooow disks ☹️ )
Update:
I ran check --repair
, which gave the following output:
This is a potentially dangerous function.
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!
Type 'YES' if you understand this and want to continue: YES
1 orphaned objects found!
Archive consistency check complete, problems found.
Then I ran check
again, and it did not find any problems. So it seems like check --repair
did resolve my problem for now.
Alright, chiming in. This repository was running smoothly for several months. The access pattern is roughly:
Today, the weekly check triggered and found an object with a PUT, but without an entry in the on-disk index.
borg check --debug
I dug deeper and can confirm that there is a PUT in that segment file for that object, that the object is valid (CRC check passes), but that the object is not contained in the on-disk index.
Note that I run three prune commands (with different -P/-a) in series before the compact command is run. I wonder if this is a corner case which can only be triggered if compact is called after multiple prune operations, instead of right after a prune?
The repository is small enough for me to copy for later investigations, but unfortunately I cannot share it as it is unencrypted.
More information:
Borg version 1.2.3 (Debian backports 1.2.3-1~bpo11+1)
Storage is a USB-attached SATA disk (hosting a bunch of other borg repos so far without issues)
No ECC RAM
You can ping me in IRC for more details (jssfr
).
@horazont check if the object is actually used or orphaned.
Also, did you have index entries with no corresponding segment file entry? May due to a bit flip in the chunkid?
check if the object is actually used or orphaned.
Using custom tooling, I scanned all chunk lists of all items from all archives in that repository, and no entry referenced this particular chunk ID.
I'm now running borg check --repair -v
(after having taken a copy of the repository for later investigation) to confirm this.
Also, did you have index entries with no corresponding segment file entry?
No, there are no superfluous index entries (again, checked with custom tooling, borg check --repair
is still pending).
I also assume that this would've been part of the borg check --debug
output above?
Would it be useful if I added the --debug
switch to my nightly borg compact
runs to gather more data in case this happens again?
borg check -v --repair
returned:
borg check -v --repair
so it seems the chunk was, indeed, orphaned.
Note that I run three prune commands (with different -P/-a) in series before the compact command is run. I wonder if this is a corner case which can only be triggered if compact is called after multiple prune operations, instead of right after a prune?
My stress testing script does many prunes between each call to compact, and it doesn't trigger the orphans even when run for a long time. Incidentally, I had another orphan yesterday, and also about a month ago, so this keeps happening to my repos. I have --debug
output from compacting that I will post as soon as the repair is done.
The output of borg compact --debug
is large, so it is in this attached file: borg-compact-output.txt
The output of borg check
is:
Starting: borg check /Backups/borg/home.borg
Starting repository check
finished segment check at segment 41655
Starting repository index check
Index object count mismatch.
committed index: 1401940 objects
rebuilt index: 1401941 objects
ID: b8e7f18ca5571c6141261e8a7771bac9b40f0d1435c5b2c6f131582d1752be3e
rebuilt index: (14263, 199156908) committed index: <not found>
Finished full repository check, errors found.
The output of borg check --repair
is:
# borg check -v --repair ; chown -R bu:bu $BORG_REPO
This is a potentially dangerous function.
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!
Type 'YES' if you understand this and want to continue: YES
Starting repository check
finished segment check at segment 41695
Starting repository index check
Finished full repository check, no problems found.
Starting archive consistency check...
Analyzing archive host1-home-20170329-10:51:14 (1/415)
...
Analyzing archive hostN-home-20230206-12:11:01 (415/415)
1 orphaned objects found!
Deleting 1 orphaned and 415 superseded objects...
Finished deleting orphaned/superseded objects.
Writing Manifest.
Committing repo.
Archive consistency check complete, problems found.
As expected, it was an orphaned object, and the repair was successful.
@jdchristensen the segment with the orphaned object looked like "nothing special" AFAICS in the logs.
@ThomasWaldmann So the compaction logs don't give enough information to figure out what is going wrong there? (Or maybe the bug isn't in borg compact
? But people have given examples where the repo passes borg check
but then fails borg check
after compaction.)
@jdchristensen It didn't compact the segment file that contained the orphaned chunk.
@horazont if you still have the orphaned chunk, you could have a look inside it to check what it is: file content data? item metadata stream chunk? archive item data?
also, if you still have the repo index, you could examine ALL buckets if you can find the entry for the orphan. theoretically it could be a bug in the hashtable code so that it does not find stuff under certain circumstances although it is there. we had such a bug once (before 1.1.11 iirc), see the advisory in the changelog. the latter would explain the repo index discrepancy.
why it is orphaned is another issue though, but borg might produce orphans e.g. for input files with OSErrors (like IOError).
I see this issue quite frequently in two of my repositories. Would it be helpful if I stopped pruning/compacting to see if that stops reproducing the issue?
@horazont btw, if you write custom tooling in python, we could add some more borg debug ...
commands.
@magma1447 yeah, guess that should limit the amount of code we have to look at. maybe leaves still a bit of uncertainty in case it does not happen (then the question is how long to proceed without pruning). Do you monitor free space?
I guess I have some fun now writing a hashtable stress tester. 🤣
The stress tester (see #7324) didn't show anything remarkable in my short test runs.
% BORG_TESTS_SLOW=1 pytest -k test_hashindex_stress
@ThomasWaldmann I have plenty of space so I can live without pruning for a few weeks.
I'll try to edit the relevant scripts tomorrow. If I don't see any issues in two weeks I am 99% certain that those commands are relevant.
@ThomasWaldmann
also, if you still have the repo index, you could examine ALL buckets if you can find the entry for the orphan. theoretically it could be a bug in the hashtable code so that it does not find stuff under certain circumstances although it is there. we had such a bug once (before 1.1.11 iirc), see the advisory in the changelog. the latter would explain the repo index discrepancy.
I went low-tech here and fed the object ID in hexedit to search for it. The 32bit number following that was 0xfffffffe, which marks a deleted entry. This does not technically rule out a bug in the hashindex code w.r.t. bucketing, but it seems to make that rather unlikely, right?
00E467E0 E8 BB 79 6D B7 D1 C3 B6 A0 52 71 1B 00 00 09 67 84 0A C0 69 F0 BF C6 79 ..ym.....Rq....g...i...y
00E467F8 39 4F F8 10 7E 56 57 8B 44 6E CA C8 40 CF AC 96 33 50 28 21 9B 63 BB F4 9O..~VW.Dn..@...3P(!.c..
00E46810 E9 5D FE FF FF FF 56 D9 62 1A 6C C6 07 74 D6 95 23 93 B8 62 B2 47 AF 54 .]....V.b.l..t..#..b.G.T
(The chunk ID is the sequence starting with C0 69 F0 BF
at the end of the first row.)
if you still have the orphaned chunk, you could have a look inside it to check what it is: file content data? item metadata stream chunk? archive item data?
This looks like item metadata stream (if I got the term right): starts (roughly) with a filepath, then strings like "uid", "user", "chunks" (followed by some noise), "ctime", "hardlink", "group" etc. I'm happy to share the first few hundred of bytes for you to confirm in private, but I'd rather not post this publicly to be safe.
@horazont ok, so you had:
So, did we lose the DEL in the segment file somehow? (there is some rather complicated code about this in compact_segments
).
ok, so you had:
the entry for that chunk deleted from the repo index (and likely also a committed DEL operation for that chunkid in the segment files)
check complained about the same chunk being orphaned (that means it is not considered deleted, but not referenced by anything)
So, did we lose the DEL in the segment file somehow? (there is some rather complicated code about this in
compact_segments
).
That sounds about right.
I gave the code in compact_segments
a quick glance yesterday. The only way I could figure out that this could happen is the shadow_index
being wrong, but I'm not familiar with how that is built to understand how that could happen.
Is there documentation on the shadow index format somewhere? I could look at the hints.
file from before the repair (but after the compact), though I suspect that that might be too late (I think I spotted a del self.shadow_index[key]
in the same branch which also deletes the DEL during compaction). (Currently in a break at work so cannot look deeply into the docs or code.)
I'll set a reminder to make a copy of the repository before the next compact run in the hopes that this occurs again to track it down.
Alright, instead of a reminder, I reconfigured things to on the nightly runs do the following:
ExecStart=/usr/bin/rsync -ra --delete /mnt/quadup/borg/adrastea/ /mnt/quadup/borg/adrastea.pre-prune-copy/
ExecStart=/usr/bin/borg prune '-a--*' --stats --list -d7 /mnt/quadup/borg/adrastea
ExecStart=/usr/bin/borg prune -Pvar-lib- --stats --list -d7 /mnt/quadup/borg/adrastea
ExecStart=/usr/bin/borg prune -Pvar-www- --stats --list -d7 /mnt/quadup/borg/adrastea
ExecStart=/usr/bin/rsync -ra --delete /mnt/quadup/borg/adrastea/ /mnt/quadup/borg/adrastea.pre-compact-copy/
ExecStart=/usr/bin/borg compact /mnt/quadup/borg/adrastea
ExecStart=/usr/bin/borg check --repository-only /mnt/quadup/borg/adrastea
That means I have a pre-prune and a pre-compact copy of the repository and then run a check, meaning that if anything goes wrong, I have three states of the repository: before prune, after prune+before compact, and after compact.
Edit: I found an email from yesterday which already pointed out the failed borg check. This was a false alarm in the sense that the conclusion below that borg create must be at fault is not correct. I'll repair the repo and will have to wait for the next occurence :(.
@horazont there is one known issue in borg create
which I only fixed very recently in master branch:
if an input file has e.g. an I/O error in the middle of its content, some chunks before that are already written to the repo.
Due to the error, it then skips that file and logs E path/filename
, but it continues to back up the remaining files and then commits all changes (including the chunks of the skipped file). As these chunks are not referenced by an archive item, they are orphan. Such orphan content chunks are expected for borg < 2 (and are harmless).
hat should limit the amount of code we have to look at. maybe leaves still a bit of uncertainty in case it does not happen (then the question is how long to pr
I have not seen the error since I turned off prune and compact 3-4 weeks ago. It was close to 100% chance for me to get the error every week before that. I base it on weekly since I have two servers running backups once per week, to two different locations. Server 1 runs on Tuesdays to location A and B. Server 2 runs on Wednesdays to the same two locations.
I will now turn on prune again, but leave compact off for another month.
I just realized that I have 8 clients using borg 1.2.x. All 8 is running a purge, but only 2 of them has been running compact. The two that has been running compact share repository, the rest has their own. I have only had issues with the two that actually runs compact.
Also, I still haven't had issues with the repository where I removed both purge and compact, and then added purge again (3 weeks with purge and without compact has now passed).
So my conclusion is that it's either compact causing the issue, or that two clients push into the same repository combined with compact. Regardless it seems to me that the bug is in the compact code (or related).
All my tests has been with version 1.2.2 of borg. I recently upgraded to 1.2.4 instead, but haven't enabled compact yet. I bet you would know if it was fixed though, so it's probably safe to assume that the issue is still there.
Hello,
Any news about this issues? We have a lot of borg check failing with time ^^'
There is also a bounty on it :)
@the-glu no news yet - you could read them here, if there were any.
I looked at this multiple times, but did not find any bug yet nor any suspicious change relative to borg 1.1.
But thanks for the hint about the bounty, I updated the top post and the labels accordingly.
I'm experiencing the same thing on two repos that are backing up the same source data. Example:
summary:
/etc/borgmatic.d/common/borgmatic.yml: An error occurred
ssh://borg@borgvm.iyer.lan/mnt/arambkup/borg: Error running actions for repository
Remote: Starting repository check
Remote: finished segment check at segment 20505
Remote: Starting repository index check
Remote: Index object count mismatch.
Remote: committed index: 1569644 objects
Remote: rebuilt index: 1569645 objects
Remote: ID: fe6756851f9c526ac6c9a8db98f648c68e1e95e8bd16b606568fd52cab262f6e rebuilt index: (2009, 324386425) committed index: <not found>
Remote: Finished full repository check, errors found.
Not sure if it's relevant, but borg does exit 1. Happy to provide any additional info if it would help.
Hi, I'm also being affected by this, apparently. The error message I get is almost identical to the one above. I also have a different number of rebuilt index objects and committed index objects, more precisely:
committed index: N objects
rebuilt index: N+1 objects
I'm on 1.2.4.
Alright, I have a 109 GB repository directory which reproduces the error after running borg compact. I.e. I have a directory I can:
I am not familiar with the compact_segments code so I'm not sure where to add useful debugging here, but I'm happy to test any patches on this. I've got enough space to keep an un-compacted copy around for test runs.
Adding stuff from IRC channel:
20:07 jssfr$ so on my reproducer repo, compact --debug prints this: https://dpaste.com/B8AF82MDT
20:07 jssfr$ and check --debug prints this: https://dpaste.com/2P3LLP3EX
20:08 jssfr$ the corresponding debug log entry is: dropping DEL for id a57c1ad2b06c41e6a157dcf9fc0beda53d1f386e218d55ef55d584706c2285ef - seg 16205, iti 17022, knisi False, spe False, dins False, si []
So:
DEL x
PUT x
in the segment files, but not in the index, adds it to the indexTo quote from the check --repair:
1 orphaned objects found! Deleting 1 orphaned and 39 superseded objects...
Any further investigation?
Not sure if it gives addtl. insights, but you could try to extract that "problematic" chunk and look inside it, see the borg debug commands in some comments above.
But as it looks, borg compact
is dropping a DEL that it should not drop.
It's a bug, but a harmless one as an unused/orphan chunk is re-appearing and then getting deleted again.
get-obj fails to retrieve it and I don't have enough scratch space for dump-repo-objs --ghost
. I did a manual search for the object in the segments and it looks like compressed data. Decompressing the data looks like medium-entropy data, I cannot ascertain what it belongs to. Does not look like archive stream data or similar (in contrast to https://github.com/borgbackup/borg/issues/6687#issuecomment-1422745298 ).
Deleting the hints file before running compact seems to avoid the issue.
That indicates to me that the shadow_index
is wrongly generated by the deletes/prunes and the compact logic itself is sound.
@horazont if there is no entry for some chunkid in the shadow index, that triggers a special case in the compact code (like "we do not know about shadowed PUTs, thus we better keep the DEL"), so this is what you achieve by deleting the hints file.
The problem in your hints file seems to be that the list of segment numbers with shadowed PUTs is empty for that chunkid, so the compact code decides to drop the seemingly not needed DEL. But the list should not be empty as there actually still is a PUT for that chunkid in the segment files.
Maybe the shadow_index being wrong is related to these changes: #5636 #5670
could it be that the issues seen here were seen after borg did "double-puts" (writing a chunk x to the repo which already has a chunk x), like when using borg recreate
to change the compression algorithm?
the manifest gets also always (over)written to same id 0, but might behave differently because it is always written to its own segment file (causing the segment with the superseded manifest always being compacted) - otherwise we would have seen this issue all the time, not just rarely.
not 100% sure if this also fixes #6687 (this issue), but it could be.
if somebody is seeing this frequently, locally applying the fix and observing the behaviour in practice would be welcome! the fix is for the repo code, so it needs to be applied on the repo side when running borg in client/server mode.
See #5661 for some conclusion.
More conclusions about shadow_index
(short: si):
not persisting si in borg < 1.2 and then persisting it in borg 1.2 might cause an issue: if there was an old PUT x ... DEL x in the repo (made with borg < 1.2) we do not have a si entry about that. if borg >= 1.2 does a PUT x ... DEL x we will have a si entry about x, but it will only list the 2nd PUT (and borg will assume that an existing entry for x lists all the PUT segment numbers).
if then borg compacts the segment file with the first DEL, it will drop the DEL as it does not know about the 1st PUT. it will still be correct, because there is that last DEL, keeping x deleted.
if then borg compacts the segment files with the last PUT and also the one with the last DEL, then these both will be removed. that will resurrect the first PUT when borg check --repair rebuilds the index (repository part) and delete it again after the archives part, because it is orphan.
after that, it should be stable, as there is an si entry now for x listing the segment number of the first PUT.
TODO: think about whether borg check
or borg compact --total
could create a complete si.
Testing the PR #7896 and #7897:
Practical testing of these is very welcome (be careful!).
For @horazont :
I suspect the users who either have old repos from borg < 1.2 or use borg recreate to change the compression are more likely to experience this issue than with repos that are fresh since borg 1.2 and never see borg recreate to change compression.
After another self-review of these, I just merged the 2 PRs into 1.2-maint.
Guess if some more people besides @horazont could practically test the code (see previous comment), we could have a 1.2.7 release rather soon.
I deleted/changed some ancient data on that host. Ideally, that will trigger relevant ancient (borg 1.1.x-created) chunks to be expired from the repository next week, which would then trigger the issue.
The trap triggered.
With the pre-prune copy and borg 1.2.6
from debian testing, I did:
Then I checked out 1.2-maint and installed it into a venv (borg 1.2.7.dev79+g70eed5e9
). I reset the repository to its previous state and deleted any data related to that repository from ~/.config/borg and ~/.cache/borg.
Using the borg from the venv, I did:
:partying_face:
For the record: the --repair is necessary, just using 1.2-maint before pruning is not sufficient. I think that is expected?
@horazont yes, I added rebuilding of the compaction infos when using --repair
.
Have you checked borgbackup docs, FAQ, and open Github issues?
Yes.
Is this a BUG / ISSUE report or a QUESTION?
Possible bug.
System information. For client/server mode post info for both machines.
Your borg version (borg -V).
1.2.0 on all clients and servers. Previous version was 1.1.17 (not 1.1.7, as I wrote on the mailing list).
Operating system (distribution) and version.
Debian buster and Ubuntu 20.04 on servers. Debian buster, Ubuntu 20.04 and Ubuntu 21.10 on clients.
Hardware / network configuration, and filesystems used.
Multiple local and remote clients accessing each repository. Repositories are on ext4, on RAID 1 mdadm devices, with spinning disks underlying them. The Debian server also uses lvm.
How much data is handled by borg?
The repos are all around 100GB in size, with up to 400 archives each. The repositories have been in use for many years.
Full borg commandline that lead to the problem (leave away excludes and passwords)
borg check /path/to/repo [more details below]
Describe the problem you're observing.
borg check shows errors on three different repositories on two different machines. See below for details.
Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.
Yes, borg check shows the same errors when run again.
Include any warning/errors/backtraces from the system logs
I upgraded from borg 1.1.17 to 1.2.0 on several different systems on about April 9. On May 9, my monthly "borg check" runs gave errors on three repositories on two systems. Note that I use the setup where several clients do their backups into the same repositories. I don't have any non-shared repositories for comparison.
At the time of the upgrade from 1.1.17 to 1.2.0, I ran
borg compact --cleanup-commits ...
followed byborg check ...
on all repos. There were no errors then. After that, I runborg compact
without--cleanup-commits
followed byborg check
once per month. The errors occurred at the one month mark.System 1 runs Ubuntu 20.04. Two of the three repos on this machine now have errors:
System 2 runs Debian buster. One of the three repos on this machine now has errors:
I have used borg on these systems for years, and no hardware has changed recently. System 1 has the repos on a RAID 1 mdadm device with two SATA spinning disks. System 2 also has the repos on RAID 1 mdadm devices with two SATA disks, with lvm as a middle layer. In both cases, smartctl shows no issues for any of the drives, and memtester also shows no errors.
Since the errors have happened on different machines within a month of upgrading to 1.2.0, I am concerned that this is a borg issue rather than a hardware issue. It is also suspicious to me that the error is the same in all cases, with a committed index not found. Hardware errors tend to produce garbage.
I have not run repair yet. Is there anything I should do before running repair to try to figure out the issue?
Update: there is a bounty for finding/fixing this bug: https://app.bountysource.com/issues/108445140-borg-check-errors-after-upgrade-to-1-2-0