Closed jdchristensen closed 10 months ago
One thing i need to check (again), is maybe whether the --cleanup-commits
is a bit too rude (removing all 17b files), should it keep the last one maybe? The following compaction run will add a new commit at the end, but maybe we still need the previous one.
But what @jdchristensen has seen does not point in that direction.
from irc:
23:21 amiconn$ Hmm, 7 orphaned and 19 superseded objects (following an index rebuild) during borg check within 2 weeks after migrating to borg 1.2. With borg 1.1. there never were any repository errors...
23:22 ThomasWaldmann$ amiconn: was that the first check after switching?
23:22 amiconn$ nope
23:23 amiconn$ The first check found 3 orphaned objects, which were fixed by borg check --repair
23:24 amiconn$ I did the first check directly after migration as recommended by the instructions
23:28 amiconn$ Btw 19 superseded objects coincides with the number of archives in the repo
If useful, I am using borg-1.2 (every latest release) compact --cleanup-commits once a week since years on my 1.1 repo, and never had any check issue...
Adding some more info on my usage of borg 1.2 on my 1.1 repo:
borg create
every day with 1.1;borg compact --cleanup-commits
once a week since years (to cleanup the 17 byte files);borg prune
with 1.2.0.I rarely do checks and I always use --repository-only
, I just did one and never found a problem.
@FabioPedretti the repo-only check can only find some specific errors, but e.g. it can not know about orphaned chunks.
OK, I may try again without it in the future, do you think it should take a lot more time?
Over 26h with --repository-only
on a 2.62 TB Deduplicated size repo, 5429445 Unique chunks, 3113 archives.
A full check adds time that is proportional to archive_count * archive_size
. So if you have thousands of large archives in there, it'll take long.
I was going to suggest that @FabioPedretti could run a check with -v --archives-only --last 10
to save time, but I tested it and I see that it prints a message saying that it skips checking for orphans unless all archives are checked (which makes sense).
I'm also experiencing check errors in a bunch of borg repos. Multiple hosts run backups to a central VM. Each client has its own repo. The backups reside on an ext4 file system (VM) on ZFS (hypervisor) and are automated using borgmatic.
Example error:
I was able to repair the errors using check --repair
but in at least one repo the next check one week after had errors again.
Both clients and server currently run borg 1.2.0.
I just upgraded the server to 1.2.1 to use borg debug dump-repo-objs
and successfully extracted a few files, which all look fine.
@davidmehren running a full check --repair
, not just --repository-only
would be interesting.
What the above log shows is that there are objects in the segment files which were not in the committed index. But even more interesting would be whether these objects are actually referenced by archives (then this would be a relatively severe error) or not (orphaned / superceded objects), which would be rather harmless (but still unpretty).
I ran check --repair
directly on the central server, on the same repo as in the error message above:
# borg --debug check --repair .
using builtin fallback logging configuration
33 self tests completed in 1.13 seconds
This is a potentially dangerous function.
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!
Type 'YES' if you understand this and want to continue: YES
Starting repository check
Verified integrity of /mnt/borgpool/sharelatex2/backup/index.2376
Read committed index of transaction 2376
Cleaned up 0 uncommitted segment files (== everything after segment 2376).
Segment transaction is 2376
Determined transaction is 2376
Found 45 segments
checking segment file /mnt/borgpool/sharelatex2/backup/data/0/6...
checking segment file /mnt/borgpool/sharelatex2/backup/data/0/7...
checking segment file /mnt/borgpool/sharelatex2/backup/data/0/8...
checking segment file /mnt/borgpool/sharelatex2/backup/data/0/40...
checking segment file /mnt/borgpool/sharelatex2/backup/data/1/1573...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2181...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2182...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2183...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2184...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2185...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2186...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2187...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2263...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2275...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2277...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2281...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2285...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2289...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2293...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2297...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2301...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2313...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2314...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2316...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2320...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2324...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2328...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2332...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2336...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2340...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2344...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2345...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2349...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2353...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2357...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2366...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2368...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2369...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2370...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2371...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2372...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2373...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2374...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2375...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2376...
finished segment check at segment 2376
Starting repository index check
Finished full repository check, no problems found.
Starting archive consistency check...
Verified integrity of /mnt/borgpool/sharelatex2/backup/index.2376
Enter passphrase for key /mnt/borgpool/sharelatex2/backup:
TAM-verified manifest
Manifest is TAM verified and says TAM is required, updating security database...
Analyzing archive sharelatex2-2022-05-22T05:19:44.740423 (1/20)
Analyzing archive sharelatex2-2022-05-23T05:34:52.894630 (2/20)
Analyzing archive sharelatex2-2022-05-24T05:51:01.223206 (3/20)
Analyzing archive sharelatex2-2022-05-25T05:39:18.517606 (4/20)
Analyzing archive sharelatex2-2022-05-26T05:40:29.177019 (5/20)
Analyzing archive sharelatex2-2022-05-27T05:39:12.190087 (6/20)
Analyzing archive sharelatex2-2022-05-28T05:37:59.986794 (7/20)
Analyzing archive sharelatex2-2022-05-29T05:27:02.741374 (8/20)
Analyzing archive sharelatex2-2022-05-30T05:19:20.174900 (9/20)
Analyzing archive sharelatex2-2022-05-31T05:26:10.903314 (10/20)
Analyzing archive sharelatex2-2022-06-01T05:17:07.432204 (11/20)
Analyzing archive sharelatex2-2022-06-02T05:32:13.477160 (12/20)
Analyzing archive sharelatex2-2022-06-03T05:48:06.501439 (13/20)
Analyzing archive sharelatex2-2022-06-04T05:52:45.158290 (14/20)
Analyzing archive sharelatex2-2022-06-05T05:47:08.629424 (15/20)
Analyzing archive sharelatex2-2022-06-08T05:19:29.677737 (16/20)
Analyzing archive sharelatex2-2022-06-09T05:06:04.154792 (17/20)
Analyzing archive sharelatex2-2022-06-10T05:03:51.264720 (18/20)
Analyzing archive sharelatex2-2022-06-11T05:01:53.968764 (19/20)
Analyzing archive sharelatex2-2022-06-12T05:52:51.009114 (20/20)
48 orphaned objects found!
Deleting 48 orphaned and 20 superseded objects...
Cleaned up 0 uncommitted segment files (== everything after segment 2376).
Verified integrity of /mnt/borgpool/sharelatex2/backup/hints.2376
Finished deleting orphaned/superseded objects.
Writing Manifest.
Committing repo.
check_free_space: required bytes 566240528, free bytes 1671414366208
Archive consistency check complete, problems found.
[...]or not (orphaned / superceded objects), which would be rather harmless (but still unpretty).
I've ran into the issue the last week each and every day for ONE of the two former repos. The last two days the issue didn't occur. But when I repaired, logs ALWAYS mentioned some orphaned and superseded objects. I can't remember any other result besides check
telling everything is OK.
My latest scheduled borg check
runs just happened, and again there is an error with the system.borg repo on "system 1" which runs Ubuntu 20.04:
Starting repository index check
Index object count mismatch.
committed index: 2489296 objects
rebuilt index: 2489297 objects
ID: ca44bf70d3d28eafbbae6ccf87841d2af6a13fd06617db185fd5ef3669b38b0b
rebuilt index: (26830, 446211143) committed index: <not found>
That segment file has an mtime of Feb 17, 2022. I have run checks since then without errors. The chunk at that offset is some ascii text:
11 updates can be applied immediately.
2 of these updates are standard security updates.
To see these additional updates run: apt list --upgradable
If anyone knows where I would expect to find that in the filesystem, let me know and I'll look for it.
The other repos on system 1 and system 2 are fine.
@jdchristensen That could be output from apt update
or something related. Maybe /var/log/apt/...
?
Ah, this kind of thing is contained in /var/lib/update-notifier/updates-available
. It's different from what is currently in that file on any machine, so it was probably pruned at some point.
I ran a repair on this repo and got:
1 orphaned objects found!
Deleting 1 orphaned and 291 superseded objects...
Before doing the repair, I saved a hardlinked copy of the faulty repo in case it can help with debugging. As happened above, the config
file for the hardlinked copy is now corrupt! I created a new issue for this: #6768
BTW, I still have the hardlinked copy of that repo from the first time I got corruption. In that copy, this chunk would probably have been present, and was not an orphan according to the check run then. Is there some debugging I could do using that old copy?
Well, if that is a more severe error, logs about and content of that chunk would be interesting.
It was the same error (the one that started this issue). To summarize: I got two orphaned chunks around May 9. I saved a copy of the repo as "copy 1". I repaired the repo (but not the copy). On June 12, I got another orphaned chunk in the repo. I saved a copy of the repo as "copy 2". I repaired the repo (but not the copy). Now "copy 1" (from early May) might have some information about the chunk that was detected as orphaned on June 12. For example, can I use "copy 1" to find out which archives referenced the chunk at that time?
I'm also trying to reproduce this issue using the script below, which tries to mimic my usage of borg (no encryption, several "hosts" doing independent pruning, occasional "file changed" errors, etc.). So far no errors after running it for an hour or two on three machines. Any suggestions for tweaks to the script that might make it more likely to reveal a bug?
#!/bin/sh
# Usage:
# cd /tmp [or maybe a tmpfs, for speed]
# /path/to/script | tee script.out
# It will create a subfolder borgtest of the current directory
# and do all of its work in there.
# Abort if any command exits with a nonzero exit code:
set -e
borg -V
TESTDIR=borgtest
rm -rf ${TESTDIR}
CONTENT=${TESTDIR}/content
mkdir -p ${CONTENT}
export BORG_REPO=${TESTDIR}/test.repo
borg init -e none -v
# Create some unchanging files:
for k in `seq 1 100`; do echo ${k} > ${CONTENT}/static.${k}; done
while true; do
# Change a few files that will be the same for all hosts:
for k in `seq 1 4`; do
dd if=/dev/urandom of=${CONTENT}/randomshared.${k} count=1 bs=10000 2> /dev/null
done
# For each "host", change some files, create an archive, and do pruning:
for i in `seq 1 4`; do
for k in `seq 1 5`; do
dd if=/dev/urandom of=${CONTENT}/random.${k} count=1 bs=10000 2> /dev/null
done
# Create a larger file that will change during the backup.
# The size and the number of background writes may need to be adjusted.
dd if=/dev/urandom of=${CONTENT}/changing count=1 bs=1M 2> /dev/null
(for j in `seq 5000`; do date >> ${CONTENT}/changing; done &)
# Create a file that is unreadable:
date > ${CONTENT}/date.out
chmod a-r ${CONTENT}/date.out
# Do the backup; we put "|| true" at the end since we expect two warnings:
borg create -v ::host${i}-{now} ${CONTENT} || true
# Prune independently for each host using -P:
borg prune -v --list --keep-last 2 --keep-minutely 10 --keep-hourly 10 -P host${i}
done
borg compact -v
borg check -v
echo
done
There is borg debug dump-archive
and borg debug dump-archive-items
. Not sure if they output all the needed chunk references in readable form.
borg debug dump-archive
asks for a PATH argument, but I don't know what to supply there. borg debug dump-archive-items
produced a bunch of binary files that I couldn't figure out.
For dump-archive, PATH is the to-be-created output file.
Yeah, I've already suspected that dump-archive-items just writes the decrypted, decompressed binary items to files (but does not decode them to readable output).
dump-archive
does give data in a usable json format, but it's slow and large, and I'd have to run through all archives in order to try to find that chunk...
I searched for offset 446211143 in the archives made in the 2.5 months leading up to Feb 17, 2022 (the mtime of the segment file), and didn't find it. So I suspect that it had already been pruned by the time the May 9 corruption happened and I saved a copy of the repo. It did not show up as an orphan at that time, so it appears that borg correctly knew that it was deleted. Something between May 9 and June 12 seems to have caused it to become an orphan. Puzzling.
I've also been running the stress testing script, now using client/server mode via ssh to localhost. Still no orphans being produced.
Something which I find strange as well. The following is from yesterday after some days of silence:
Remote: Starting repository index check
Remote: Index object count mismatch.
Remote: committed index: 930105 objects
Remote: rebuilt index: 930107 objects
Remote: ID: 705a9f05d1e7a8ed3fa719de2455e973b4c5401c8ee9da713c91971ac14df9e6 rebuilt index: (38206, 493217629) committed index: <not found>
Remote: ID: a42aef79f2c8ea7d1becf62dc84c7879bf94b4b32801944f15a87d752c7b9bb7 rebuilt index: (5678, 332609646) committed index: <not found>
Remote: Finished full repository check, errors found.
The following is from today:
Remote: Starting repository index check
Remote: Index object count mismatch.
Remote: committed index: 930428 objects
Remote: rebuilt index: 930431 objects
Remote: ID: dfba4c044abdd4fe58135983fea9faa2731f7134c3b9f7753dc517342ab53452 rebuilt index: (85234, 387659010) committed index: <not found>
Remote: ID: 705a9f05d1e7a8ed3fa719de2455e973b4c5401c8ee9da713c91971ac14df9e6 rebuilt index: (38206, 493217629) committed index: <not found>
Remote: ID: bc45831ca445b420bf652246c9e67d2f4406ba726a7c6493253d2fe813b840a4 rebuilt index: (82632, 242255975) committed index: <not found>
Remote: Finished full repository check, errors found.
The first ID of yesterday and second one of today are the same, while others are not, even though I run the same checks every day and didn't repair yesterday. Otherwise the one repeating ID wouldn't be there most likely.
Does that mean anything? I would have expected the list of IDs simply grow for each found problem with the same IDs repeating and newer ones being added only.
Maybe another data point: Since upgrading to 1.2 (Arch Linux) one repository (rootfs on btrfs to local repo on zfs) consistently throws Index count errors every few days, while the other (data dir on zfs to local repo on zfs) has never shown any issues.
@ams-tschoening interesting. like that chunk got deleted from repo index but not from segment files (and thus reappeared in the rebuilt index).
@matmo also interesting, but could also be a coincidence.
@ams-tschoening interesting. like that chunk got deleted from repo index but not from segment files (and thus reappeared in the rebuilt index).
Two things I thought about: Might this have anything to do with increased threshold of segment file cleanup to 25 %? I use ZFS for storing BORG repos, am creating additional snapshots daily and the amount of changed data in the snapshots is far less with that value than in the past. Though, the problem started to happen at all only after Hetzner put BORG 1.2 in place in their SBOX, using 10 % by default already.
Second thought came into my mind after trying to debug this myself and looking at some code. I found the following lengthy explanation of things:
# Now we crash. But only segment 2 gets deleted, while segment 1 is still around. Now key 1
# is suddenly undeleted (because the delete in segment 2 is now missing).
# Again, note the requirement here. We delete these in the correct order that this doesn't happen,
# and only if the FS materialization of these deletes is reordered or parts dropped this can happen.
# In this case it doesn't cause outright corruption, 'just' an index count mismatch, which will be
# fixed by borg-check --repair.
I can't remember BORG ever crashing, BUT some months ago I accidently deleted some archives of some topic from one of the repos, while I really wanted to keep the oldest one. I couldn't easily re-create that archive and therefore decided to rollback the entire ZFS snapshot for that day, pretty much doing the following?
“this is either an attack or unsafe” warning
https://borgbackup.readthedocs.io/en/latest/faq.html#this-is-either-an-attack-or-unsafe-warning
Afterwards I ran multiple checks and backups etc. successfully for additional weeks/months until the warnings of this issue so didn't expect things to be related. But looking at the above comment, with my rollback I might have simply triggered exactly that state, didn't I? Some chunk might have been properly deleted in a newly created segment file which I removed with my rollback. No actual bug in BORG at all then...
OTOH, what about all other users seeing this? Did you all do a similar rollback with your repos at some point? :-) Additionally, I would expect these types of problems being recognized before version 1.2 already, as the code seems to be older.
The code comment highlighted by @ams-tschoening is from compact_segments()
. I run that locally once a week and have never seen crashes then, so I don't think it is directly relevant to the errors I am seeing.
I also did not do any rollback or anything else unusual. However, I do many backups to these repos, and follow them with prune commands, and very occasionally there is a network error during the backup or prune which causes it to die. When an error occurs during a backup, I would expect the usual rollback procedure to leave things in a consistent state. But what if there is a crash during a prune operation? Again, since compaction is a separate step, I would expect that a crash during a prune shouldn't lead to the above, but maybe I'm wrong?
@ams-tschoening borg 1.2 got some changes on the repo segment file level to have less stuff moving around, likely that's why your zfs snapshots are smaller. I don't think the compaction percentage has anything to do with this issue.
Maybe this helps in narrowing the area where to look: I think I hit the problem yesterday on a CentOS 7 machine using borg 1.1.18 while the target is a Hetzner Storagebox which uses borg 1.2. Never accessed the repo with a borg 1.2 client. Also I did not experience network issues (since the last passing check). At least no issue which would abort a backup process.
I guess looking at the dumped chunk is not really interesting but if it is I'd need way to decrypt the chunk first.
OK, so guess we can assume, it is the 1.2.x repository code, likely the compaction part.
Now a different repository on a different host is showing errors during check. This repository was only accessed by one client, over a LAN. The remote host is running Ubuntu 20.04 and borg 1.2.0. Another data point that I think I didn't mention above is that I run the fat binaries on all machines.
Oh, and one of the other repos that had errors previously has errors again. I've lost count of how many times my monthly checks have found errors, but it is happening quite often for me.
https://github.com/borgbackup/borg/issues/6899#issuecomment-1198406243 noticed something there which would explain that an on-disk repo index can be a bit behind the (intermediate-)committed repo state if compact_segments is interrupted (write_index is only called once at the very end).
but does it explain the effects seen here?
[...]if compact_segments is interrupted[...]
I don't see that happening for me.
[...]in general, users are advised to run borg prune frequently (like daily or weekly)[...]
Am doing exactly that, daily.
I run prune
after every backup and do several backups per day into my repositories. One a month, I do compact
followed by check
, and a surprising number of those checks show errors. But the logs show that the compact
commands complete successfully. The reproducible example in #6781 also shows that the error is produced by compact
, even though it completes successfully. Are there places in the compact
code where (possibly temporary) assertions could be added to catch a bug? The poster in #6781 could run this code using their reproducible example.
@jdchristensen an idea for your reproducer script: maybe create a file that always has same contents, but latest ctime/mtime (== directly before invoking borg create). like the mdstat file you noticed above.
files with latest timestamp get special treatment with the files cache. if it is additionally a file that always has same contents, it creates a rather special case (but I can't imagine yet how that could lead to problems).
btw, yesterday i took a large repo and ran a lot of del / prune / compact / check and also some create ops on it, but failed to find any issue (using 1.2-maint code). if used zfs snapshots to keep the state before/after compact, so that in case a problem shows up, i have both states for further analysis.
@jdchristensen can you use the code from #6918 and run borg compact
with --debug
?
@ThomasWaldmann I have updated my stress testing script to sometimes touch unchanged files at the end. I also changed the repo and chunk parameters in a way that should cause much more churn during compaction. I'm currently running the new scripts in tmpfs directories on two machines, but haven't found any borg check errors yet. New script attached. borg-stress.txt
I will try the new debug code as soon as I get a chance.
I ran my borg-stress script for 1.5 days on a tmpfs. It did 70000 create+prune commands and 1765 compact+check commands without any errors. There are over 28000 numbered folders in the data/ folder (many empty). The main difference between this script and my repos that have had problems is that the repos that have had problems were created with older versions of borg, probably 1.1.0b3 in 2017. Could that somehow affect things?
Hard to say before knowing the actual root cause. The compaction code has 1 relatively complex condition (where I added that debug logging recently), the rest of it is rather simple.
The entries in the segment files did not change before borg2, it's just PUT, DEL and COMMIT.
borg 1.1 did always compact at the end of repo writing commands (and did not need to persist some compaction related infos to disk), while 1.2 compacts separately (and needs to persist these infos into the hints file).
I ran borg compact --debug
on two repos and got several of the "dropping DEL for id" messages. But the following borg check
runs didn't show any errors this time. I'll try again in the future.
Moving this to 1.2.3 milestone to get 1.2.2 out quicker with other important fixes. I'll merge the debug log PR so that all 1.2.2 users can easily use that.
I adapted my stress testing script to use borg2, and I can now consistently reproduce the orphans problem. A version of the script that excludes some unnecessary parts is attached. Instructions are at the top. I have run it 15 times, and 14 times it showed orphans during the first run of borg check
(after four creates have been done and one compact, but no prunes) and 1 time it happened during the second run of borg check
. (Different random data and chunking seed can affect it, I guess.) It only works if some of those creates are killed, resulting in checkpoint archives. Sample output is attached.
Hopefully this helps find the bug! I have no idea why my script successfully reproduces the problem for borg2, even though it never showed a problem with borg1, even after many hours of running.
Edit: the errors my script found are fixed by #7305, and were a borg2-specific bug. So currently my stress testing scripts can't reproduce any of the issues people are noticing with borg compact
.
And I just noticed that the problem happens even without the borg compact
line. So with no pruning and no compacting, some orphans arise. Is it just because borg is killed during a create operation?
Maybe. IIRC we have a ticket about that borg maybe creates orphans when content chunks for a file have been written, but not the archive item for that file (which is written after all content chunks).
Thanks for the script, I'll try it!
There is evidence from others, at least with borg 1.2, that a borg repo can pass borg check
without errors, but then fail after borg compact
is run on it. So my script seems to be revealing a different problem than the one in this issue. When borg is killed with the default signal, does it try to clean things up, e.g. by writing a checkpoint archive? What happens if it is part-way through processing a large file?
Have you checked borgbackup docs, FAQ, and open Github issues?
Yes.
Is this a BUG / ISSUE report or a QUESTION?
Possible bug.
System information. For client/server mode post info for both machines.
Your borg version (borg -V).
1.2.0 on all clients and servers. Previous version was 1.1.17 (not 1.1.7, as I wrote on the mailing list).
Operating system (distribution) and version.
Debian buster and Ubuntu 20.04 on servers. Debian buster, Ubuntu 20.04 and Ubuntu 21.10 on clients.
Hardware / network configuration, and filesystems used.
Multiple local and remote clients accessing each repository. Repositories are on ext4, on RAID 1 mdadm devices, with spinning disks underlying them. The Debian server also uses lvm.
How much data is handled by borg?
The repos are all around 100GB in size, with up to 400 archives each. The repositories have been in use for many years.
Full borg commandline that lead to the problem (leave away excludes and passwords)
borg check /path/to/repo [more details below]
Describe the problem you're observing.
borg check shows errors on three different repositories on two different machines. See below for details.
Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.
Yes, borg check shows the same errors when run again.
Include any warning/errors/backtraces from the system logs
I upgraded from borg 1.1.17 to 1.2.0 on several different systems on about April 9. On May 9, my monthly "borg check" runs gave errors on three repositories on two systems. Note that I use the setup where several clients do their backups into the same repositories. I don't have any non-shared repositories for comparison.
At the time of the upgrade from 1.1.17 to 1.2.0, I ran
borg compact --cleanup-commits ...
followed byborg check ...
on all repos. There were no errors then. After that, I runborg compact
without--cleanup-commits
followed byborg check
once per month. The errors occurred at the one month mark.System 1 runs Ubuntu 20.04. Two of the three repos on this machine now have errors:
System 2 runs Debian buster. One of the three repos on this machine now has errors:
I have used borg on these systems for years, and no hardware has changed recently. System 1 has the repos on a RAID 1 mdadm device with two SATA spinning disks. System 2 also has the repos on RAID 1 mdadm devices with two SATA disks, with lvm as a middle layer. In both cases, smartctl shows no issues for any of the drives, and memtester also shows no errors.
Since the errors have happened on different machines within a month of upgrading to 1.2.0, I am concerned that this is a borg issue rather than a hardware issue. It is also suspicious to me that the error is the same in all cases, with a committed index not found. Hardware errors tend to produce garbage.
I have not run repair yet. Is there anything I should do before running repair to try to figure out the issue?
Update: there is a bounty for finding/fixing this bug: https://app.bountysource.com/issues/108445140-borg-check-errors-after-upgrade-to-1-2-0