borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11.11k stars 740 forks source link

borg check errors after upgrade to 1.2.0 #6687

Closed jdchristensen closed 10 months ago

jdchristensen commented 2 years ago

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes.

Is this a BUG / ISSUE report or a QUESTION?

Possible bug.

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

1.2.0 on all clients and servers. Previous version was 1.1.17 (not 1.1.7, as I wrote on the mailing list).

Operating system (distribution) and version.

Debian buster and Ubuntu 20.04 on servers. Debian buster, Ubuntu 20.04 and Ubuntu 21.10 on clients.

Hardware / network configuration, and filesystems used.

Multiple local and remote clients accessing each repository. Repositories are on ext4, on RAID 1 mdadm devices, with spinning disks underlying them. The Debian server also uses lvm.

How much data is handled by borg?

The repos are all around 100GB in size, with up to 400 archives each. The repositories have been in use for many years.

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg check /path/to/repo [more details below]

Describe the problem you're observing.

borg check shows errors on three different repositories on two different machines. See below for details.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Yes, borg check shows the same errors when run again.

Include any warning/errors/backtraces from the system logs

I upgraded from borg 1.1.17 to 1.2.0 on several different systems on about April 9. On May 9, my monthly "borg check" runs gave errors on three repositories on two systems. Note that I use the setup where several clients do their backups into the same repositories. I don't have any non-shared repositories for comparison.

At the time of the upgrade from 1.1.17 to 1.2.0, I ran borg compact --cleanup-commits ... followed by borg check ... on all repos. There were no errors then. After that, I run borg compact without --cleanup-commits followed by borg check once per month. The errors occurred at the one month mark.

System 1 runs Ubuntu 20.04. Two of the three repos on this machine now have errors:

# borg check /Backups/borg/home.borg
Index object count mismatch.
committed index: 1166413 objects
rebuilt index:   1166414 objects
ID: 8a158ba7fdfae9b1373063a5bb5ea8ea6698c93ed7feff89ca6ff0a3c8842ebd
rebuilt index: (18596, 199132336) committed index: <not found>
Finished full repository check, errors found.

# ls -l /Backups/borg/home.borg/data/37
total 1453100
-rw------- 2 bu bu 201259844 Dec  8  2020 18596
-rw------- 2 bu bu 185611530 Dec 12  2020 18651
-rw------- 2 bu bu 125106377 Dec 25  2020 18858
-rw------- 2 bu bu 524318301 Dec 26  2020 18874
-rw------- 2 bu bu 193813842 Dec 30  2020 18940
-rw------- 2 bu bu 116657254 Dec 30  2020 18945
-rw------- 2 bu bu 141181725 Dec 31  2020 18953

# borg check /Backups/borg/system.borg
Index object count mismatch.
committed index: 2324200 objects
rebuilt index:   2324202 objects
ID: 1e20354918f4fdeb9cc0d677c28dffe1a383dd1b0db11ebcbc5ffb809d3c2b8a
rebuilt index: (24666, 60168) committed index: <not found>
ID: d9c516b5bf53f661a1a9d2ada08c8db7c33a331713f23e058cd6969982728157
rebuilt index: (3516, 138963001) committed index: <not found>
Finished full repository check, errors found.

# ls -l /Backups/borg/system.borg/data/49
total 3316136
-rw------- 2 bu bu 500587725 Oct  5  2021 24555
-rw------- 2 bu bu 168824081 Oct  8  2021 24603
-rw------- 2 bu bu 116475028 Oct  9  2021 24619
-rw------- 2 bu bu 107446533 Oct 11  2021 24634
-rw------- 2 bu bu 252958665 Oct 12  2021 24666
-rw------- 2 bu bu 124871243 Oct 19  2021 24777
-rw------- 2 bu bu 277627834 Oct 19  2021 24793
-rw------- 2 bu bu 231932763 Oct 21  2021 24835
-rw------- 2 bu bu 114031902 Oct 22  2021 24847
-rw------- 2 bu bu 127020577 Oct 26  2021 24899
-rw------- 2 bu bu 220293895 Oct 26  2021 24907
-rw------- 2 bu bu 113238393 Oct 27  2021 24933
-rw------- 2 bu bu 525154704 Oct 27  2021 24941
-rw------- 2 bu bu 291472023 Oct 27  2021 24943
-rw------- 2 bu bu 223721033 Oct 30  2021 24987

# ls -l /Backups/borg/system.borg/data/7
total 1200244
-rw------- 2 bu bu 524615544 Feb  4  2018 3516
-rw------- 2 bu bu 145502511 Feb  5  2018 3529
-rw------- 2 bu bu 266037549 Feb 21  2018 3740
-rw------- 2 bu bu 292869056 Mar 14  2018 3951

System 2 runs Debian buster. One of the three repos on this machine now has errors:

# borg check /Backups/borg/system.borg
Index object count mismatch.
committed index: 2052187 objects
rebuilt index:   2052188 objects
ID: 6b734ed388e7e086af7107847c6b6d3d34a29c20e7e539ded71b32606cb857bd
rebuilt index: (946, 15871355) committed index: <not found>
Finished full repository check, errors found.

# ls -l /Backups/borg/system.borg/data/1
total 205308
-rw------- 1 bu bu 210234581 Jun 20  2017 946

I have used borg on these systems for years, and no hardware has changed recently. System 1 has the repos on a RAID 1 mdadm device with two SATA spinning disks. System 2 also has the repos on RAID 1 mdadm devices with two SATA disks, with lvm as a middle layer. In both cases, smartctl shows no issues for any of the drives, and memtester also shows no errors.

Since the errors have happened on different machines within a month of upgrading to 1.2.0, I am concerned that this is a borg issue rather than a hardware issue. It is also suspicious to me that the error is the same in all cases, with a committed index not found. Hardware errors tend to produce garbage.

I have not run repair yet. Is there anything I should do before running repair to try to figure out the issue?

Update: there is a bounty for finding/fixing this bug: https://app.bountysource.com/issues/108445140-borg-check-errors-after-upgrade-to-1-2-0

jdchristensen commented 2 years ago

I should also mention that I don't use encryption with any of these repos.

ThomasWaldmann commented 2 years ago

Thanks for the detailed bug report.

ThomasWaldmann commented 2 years ago

BTW, do you use borg break-lock or --bypass-lock option in your automated scripts?

jdchristensen commented 2 years ago

I have never used break-lock or bypass-lock. I updated the top post to show the correct directories.

jdchristensen commented 2 years ago

I have made a copy of one of the repos. Can you tell me exactly how to run the borg debug get-obj command? Where will it put the output?

ThomasWaldmann commented 2 years ago
% borg debug get-obj --help
usage: borg debug get-obj REPOSITORY ID PATH

thus, for the first repo:

borg debug get-obj /Backups/borg/home.borg \
    8a158ba7fdfae9b1373063a5bb5ea8ea6698c93ed7feff89ca6ff0a3c8842ebd \
    8a158ba7fdfae9b1373063a5bb5ea8ea6698c93ed7feff89ca6ff0a3c8842ebd.chunk

But likely it will only work after you have repaired the repo index.

ThomasWaldmann commented 2 years ago

The directory listings did not reveal anything special. But considering the described circumstances this looks like a bug.

ThomasWaldmann commented 2 years ago

In case we do not find the issue otherwise, maybe keep an unmodified copy of at least one of the repos until this is resolved.

jdchristensen commented 2 years ago

I made a full copy of the third repository above (on the Debian system) and made hard link copies of the first two repositories. I'm now doing repairs on two of the repositories. When they complete, I will include the output, and do the get-obj check.

jdchristensen commented 2 years ago

On the Debian system:

# borg check -v --repair /Backups/borg/system.borg 
This is a potentially dangerous function.
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!

Type 'YES' if you understand this and want to continue: YES
Starting repository check
finished segment check at segment 38108
Starting repository index check
Finished full repository check, no problems found.
Starting archive consistency check...
Analyzing archive estrela-system-20170729-04:32:29 (1/236)
...
Analyzing archive boots-system-20220509-06:32:25 (236/236)
1 orphaned objects found!
Deleting 1 orphaned and 236 superseded objects...
Finished deleting orphaned/superseded objects.
Writing Manifest.
Committing repo.
Archive consistency check complete, problems found.

On the Ubuntu system:

# borg check -v --repair /Backups/borg/home.borg
This is a potentially dangerous function.
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!

Type 'YES' if you understand this and want to continue: YES
Starting repository check
finished segment check at segment 32373
Starting repository index check
Finished full repository check, no problems found.
Starting archive consistency check...
Analyzing archive estrela-home-20170329-10:51:14 (1/405)
...
Analyzing archive yogi-home-20220509-02:01:01 (405/405)
1 orphaned objects found!
Deleting 1 orphaned and 420 superseded objects...
Finished deleting orphaned/superseded objects.
Writing Manifest.
Committing repo.
Archive consistency check complete, problems found.
jdchristensen commented 2 years ago

On the Debian system:

# borg debug get-obj /Backups/borg/system.borg 6b734ed388e7e086af7107847c6b6d3d34a29c20e7e539ded71b32606cb857bd 6b734ed388e7e086af7107847c6b6d3d34a29c20e7e539ded71b32606cb857bd.chunk
object 6b734ed388e7e086af7107847c6b6d3d34a29c20e7e539ded71b32606cb857bd not found.

# ls -l /Backups/borg/system.borg/data/1
total 205308
-rw------- 1 bu bu 210234581 Jun 20  2017 946

On the Ubuntu system:

# borg debug get-obj /Backups/borg/home.borg 8a158ba7fdfae9b1373063a5bb5ea8ea6698c93ed7feff89ca6ff0a3c8842ebd 8a158ba7fdfae9b1373063a5bb5ea8ea6698c93ed7feff89ca6ff0a3c8842ebd.chunk
object 8a158ba7fdfae9b1373063a5bb5ea8ea6698c93ed7feff89ca6ff0a3c8842ebd not found.

# ls -l /Backups/borg/home.borg/data/37
total 1453100
-rw------- 2 bu bu 201259844 Dec  8  2020 18596
-rw------- 2 bu bu 185611530 Dec 12  2020 18651
-rw------- 2 bu bu 125106377 Dec 25  2020 18858
-rw------- 2 bu bu 524318301 Dec 26  2020 18874
-rw------- 2 bu bu 193813842 Dec 30  2020 18940
-rw------- 2 bu bu 116657254 Dec 30  2020 18945
-rw------- 2 bu bu 141181725 Dec 31  2020 18953

The repair said that orphans were being deleted, so maybe that is why they don't show up? And maybe the segment files are still there because I haven't run compact?

jdchristensen commented 2 years ago

I get the same "object ... not found" message if I try get-obj on the (unrepaired) copy of the Debian repo.

ThomasWaldmann commented 2 years ago

OK, so the 1 missing index entry was an orphan chunk. That takes quite a bit of severity off from this ticket. :-)

It can't be found in the index before repair, because it is not in the index (maybe it never was added to the index).

It can't be found in the index after repair, because it was determined orphaned (no archive uses this object) and thus removed.

So what remains to be determined is what borg operation creates this orphan chunk.

ThomasWaldmann commented 2 years ago

What could be done to determine the contents of the orphan chunk is to seek to the given offset into the given segment file and look what's there.

jdchristensen commented 2 years ago

I tried looking at that offset in the file, but the chunks are compressed, so it wasn't obvious. If there's some way to extract the chunk and decompress it, let me know.

About the cause: would network errors cause orphaned chunks? I do occasionally see backups that don't successfully complete due to a network error. But I thought that further repo operations and/or borg compact would cleanup such things.

What is particularly odd is that none of those segment files has a recent date. One of them is from 2017! Maybe a chunk in there was somehow left after a prune operation? Maybe a prune operation had a network issue?

jdchristensen commented 2 years ago

I'm also puzzled that the segment file is still there after the borg repair. Maybe the chunk will get removed after a borg compact? Or maybe these chunks are small enough that borg decides to leave them in place?

ThomasWaldmann commented 2 years ago

If a small not-used-anymore (== "deleted") chunk sits in a segment file with other still-used chunks, it might well stay there. borg computes a ratio deleted/total size and if the ratio is below the threshold, borg compact won't compact the segment.

in recent borg, you can also give borg compact a threshold of 0 (but that might move a lot of data around) for not much space saving.

ThomasWaldmann commented 2 years ago

BTW, as long as the chunk is logically deleted, it won't show as orphan.

ThomasWaldmann commented 2 years ago

You could try this (it should decrypt and decompress as needed). It dumps ALL objects, so needs quite some space / time.

mkdir myrepo-dump
cd myrepo-dump
borg debug dump-repo-objs --ghost ../myrepo
ThomasWaldmann commented 2 years ago

From the borg check docs:

In repair mode, when all the archives were checked, orphaned chunks are deleted
from the repo. One cause of orphaned chunks are input file related errors (like
read errors) in the archive creation process.

E.g. if it starts backing up file_with_ioerror_in_the_middle:

Maybe borg could do some cleanup in an exception handler, but IIRC it does not do that yet.

jdchristensen commented 2 years ago

None of this explains why there would be orphan chunks from years ago. I run borg check every month, and it never reported any errors until yesterday, so I don't think these orphan chunks could have arisen from I/O errors at the time of backup. Somehow they recently became orphans. Could there be a bug in prune? Or maybe several prune operations got interrupted, all in the same month, even though that never caused an error in the past five years?

ThomasWaldmann commented 2 years ago

Yeah, guess that's right. Would be interesting to look at the plaintext of that chunk.

jdchristensen commented 2 years ago

I don't think it's worth the time/space to extract all chunks. I tried using

dd skip=15871355 count=1000000 if=system-bad.borg/data/1/946 of=chunk bs=1

to extract 1MB starting where the chunk starts (this is on the Debian system), and then using od -x chunk to view the bytes, but I don't see any zlib magic bytes there. A debug option to extract a chunk based on segment and offset would be handy.

Here is the start of the od -x chunk output:

0000000 3e63 51fa 0033 0000 6b00 4e73 88d3 e0e7
0000020 af86 0771 7c84 6d6b 343d 9ca2 e720 39e5
0000040 d7de 321b 6c60 57b8 02bd 0000 302d 332e
0000060 3832 100a 517c ceea 0000 0000 c4d8 9186
0000100 a311 4cb9 0ebb d948 783d 1c02 1a37 fc6f
0000120 040e 54d8 7699 1d58 3cf3 758c 7802 559c
0000140 4b8e 830e 0c30 ef44 35e2 ce41 721f 2e0e
0000160 8110 a8f9 5688 da49 e20d 0dee 5510 caea
0000200 f1e3 db1b 0c3b 793e 3b70 cff4 5c6d 5ba6
0000220 3a05 a06b 5031 4f9c fab9 c098 4081 196e
0000240 c66a 876d 29ca 44e9 8b53 2b56 0da9 2c97
0000260 885b 1c70 7547 ecf1 2f97 c339 7604 4e88
0000300 c769 8d4d d68d 95b2 96d6 055c 9f2f 99fa
0000320 f4ad 15a1 a99c 1220 340d 4b80 4de1 6779
0000340 fb3f a26d df98 4dc9 f1f2 e451 eb75 b21e
0000360 e325 58bc f227 ac67 bb7e 3c09 78be 49cd
ThomasWaldmann commented 2 years ago

It's a bit hard to decipher due to the 16bit formatting with bytes swapped, but:

crc32 = 51fa3e63 
len32 = 00000033
tag8 = 00 (PUT)
id256 = 6b734ed388e7e086af7107847c6b6d3d34a29c20e7e539ded71b32606cb857bd
compression = 0200 (lzma or rather "xz")
data = ... (only 8 bytes)

Hmm, does not decompress, strange.

ThomasWaldmann commented 2 years ago

It seems it is uncompressed data:

bytes.fromhex(d).decode()
'\x00-0.328\n'
jdchristensen commented 2 years ago

So a null byte, followed by the ascii string "-0.328\n". Strange. I confirmed this by adding 43 to the offset and extracting 8 bytes from the segment file. I don't know what that file could have been.

I should have mentioned that I use --compression auto,zlib, so some data won't be compressed, which happened here since such short data doesn't compress well. So I'm surprised there are bytes indicating compression.

Edit: for the record, the length above, 0x33, is 51 in decimal, and includes the 43-byte header, which is why the data portion is 8 bytes long.

jdchristensen commented 2 years ago

I repaired the third repository, the other one on the Ubuntu system. Similar results.

# borg check -v --repair /Backups/borg/system.borg
This is a potentially dangerous function.
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!

Type 'YES' if you understand this and want to continue: YES
Starting repository check
finished segment check at segment 28246
Starting repository index check
Finished full repository check, no problems found.
Starting archive consistency check...
Analyzing archive estrela-system-20170426-08:55:29 (1/289)
...
Analyzing archive yogi-system-20220509-02:04:03 (289/289)
2 orphaned objects found!
Deleting 2 orphaned and 294 superseded objects...
Finished deleting orphaned/superseded objects.
Writing Manifest.
Committing repo.
Archive consistency check complete, problems found.
ThomasWaldmann commented 2 years ago

A theory how this could happen (see https://github.com/borgbackup/borg/issues/6687#issuecomment-1121333995):

Did some reviews, but did not find anything suspicious:

git diff -r 1.1.16 -r 1.2.0 src/borg/repository.py
git diff -r 1.1.16 -r 1.2.0 src/borg/_hashindex.c
git diff -r 1.1.16 -r 1.2.0 src/borg/hashindex.pyx

But one thing came to my mind:

Have to investigate more whether that could be an issue (guess that shouldn't be an issue, the hints file could also just get lost at some time).

ams-tschoening commented 2 years ago

I think I've ran into the same problem, but with Borg 1.1.6 and didn't upgrade at all. My repo only contains 7 days of archive history and am running repository and archives check after each backup each and every day without those problems in the past.

[...]
Remote: checking segment file /home/bak_borg/pve/data/81/81759...
Remote: checking segment file /home/bak_borg/pve/data/81/81760...
Remote: checking segment file /home/bak_borg/pve/data/81/81761...
Remote: checking segment file /home/bak_borg/pve/data/81/81766...
Remote: checking segment file /home/bak_borg/pve/data/81/81767...
Remote: checking segment file /home/bak_borg/pve/data/81/81768...
Remote: checking segment file /home/bak_borg/pve/data/81/81769...
Remote: checking segment file /home/bak_borg/pve/data/81/81770...
Remote: checking segment file /home/bak_borg/pve/data/81/81771...
Remote: checking segment file /home/bak_borg/pve/data/81/81772...
Remote: checking segment file /home/bak_borg/pve/data/81/81773...
Remote: checking segment file /home/bak_borg/pve/data/81/81774...
Remote: checking segment file /home/bak_borg/pve/data/81/81775...
Remote: checking segment file /home/bak_borg/pve/data/81/81776...
Remote: checking segment file /home/bak_borg/pve/data/81/81778...
Remote: checking segment file /home/bak_borg/pve/data/81/81780...
Remote: finished segment check at segment 81780
Remote: Starting repository index check
Remote: Index object count mismatch.
Remote: committed index: 865134 objects
Remote: rebuilt index:   865135 objects
Remote: ID: 9295605fcaf23f8609b0dae0f8e93459a8db5621415e70745cd8f24f39967644 rebuilt index: (38222, 119007523) committed index: <not found>
Remote: Finished full repository check, errors found.
RemoteRepository: 169 B bytes sent, 4.58 MB bytes received, 3 messages sent
terminating with warning status, rc 1
[...]sudo borgmatic --config ~/.config/borgmatic.d/pve.yaml --verbosity 2 borg check -v --repair
[...]
Remote: complete_xfer: deleting unused segment 81780
Remote: complete_xfer: deleting unused segment 81782
Remote: complete_xfer: deleting unused segment 81784
Remote: compaction freed about 1.07 kB repository space.
Remote: compaction completed.
Finished committing repo.
Archive consistency check complete, problems found.
RemoteRepository: 3.60 MB bytes sent, 258.55 MB bytes received, 1503 messages sent
terminating with success status, rc 0
[...]sudo borgmatic --config ~/.config/borgmatic.d/pve.yaml --verbosity 2 borg check -v --repair
[...]
Remote: not compacting segment 81761 (maybe freeable: 0.06% [205533 bytes])
Remote: not compacting segment 81766 (maybe freeable: 0.04% [232783 bytes])
Remote: not compacting segment 81776 (maybe freeable: 0.00% [6123 bytes])
Remote: compacting segment 81783 with usage count 0 (maybe freeable: 99.26% [1068 bytes])
Remote: compacting segment 81785 with usage count 0 (maybe freeable: 100.00% [17 bytes])
Remote: compacting segment 81786 with usage count 0 (maybe freeable: 83.67% [41 bytes])
Remote: compacting segment 81788 with usage count 0 (maybe freeable: 52.94% [9 bytes])
Remote: complete_xfer: wrote commit at segment 81789
Remote: complete_xfer: deleting unused segment 81783
Remote: complete_xfer: deleting unused segment 81785
Remote: complete_xfer: deleting unused segment 81786
Remote: complete_xfer: deleting unused segment 81788
Remote: compaction freed about 1.07 kB repository space.
Remote: compaction completed.
Finished committing repo.
Archive consistency check complete, no problems found.
RemoteRepository: 76.11 kB bytes sent, 246.55 MB bytes received, 1422 messages sent
terminating with success status, rc 0
ThomasWaldmann commented 2 years ago

@ams-tschoening maybe also some orphaned chunks due to errors when processing input files.

You have a lot of debug msgs there, but for the repair I did not see the real issue. Maybe check the part you have shortened whether there is something interesting.

ams-tschoening commented 2 years ago

Don't have access to the output of the repair anymore, was dumb and forgot to pipe things into some file. :-) If things work again next night, I don't care too much, otherwise I will do the same like before and pipe into files.

The important thing in my opinion was that it's the same error without upgrading the software, isn't it? So it's most likely not a newly introduced thing.

ThomasWaldmann commented 2 years ago

Yeah, maybe it is a coincidence, maybe not.

ThomasWaldmann commented 2 years ago

I filed #6709 about the known case that can create orphaned chunks. While thinking about that, I noticed that these orphans likely do have a repo index entry just no reference from an archived item.

Thus, as we see a missing repo index entry here, likely the root cause is something else.

ThomasWaldmann commented 2 years ago

Maybe it would be nice to have a bigger "bad chunk" sample to better see what it actually is.

Please hexdump with byte width (not 16bit), addtl. ascii display would be also nice.

ams-tschoening commented 2 years ago

The same problem happened today as well, with two different repos. Didn't have the time to further debug this, as borg debug get-obj and stuff doesn't seem to be compatible with BorgMatic, but at least have logs this time. Maybe those are of any help. Repairing seems to have fixed it for now, though this makes me wonder compared to not have that problem for months...

Remote: checking segment file /home/bak_borg/hosts/group.bitstore.amsoft/data/22/22376...
Remote: finished segment check at segment 22376
Remote: Starting repository index check
Remote: Index object count mismatch.
Remote: committed index: 2738742 objects
Remote: rebuilt index:   2738743 objects
Remote: ID: 28eac01d646f253f847681d6e46f342d53873c905c6eb839ff3181e84232e4f5 rebuilt index: (19807, 297575235) committed index: <not found>
Remote: Finished full repository check, errors found.
RemoteRepository: 195 B bytes sent, 2.42 MB bytes received, 3 messages sent
terminating with warning status, rc 1
Command 'borg check --prefix net.elrev.software- --debug --show-rc --umask 0007 amsoft-sbox.bitstore.group:bak_borg/hosts/group.bitstore.amsoft' returned non-zero exit status 1.
`` 

```bash
Remote: checking segment file /home/bak_borg/pve/data/82/82980...
Remote: finished segment check at segment 82980
Remote: Starting repository index check
Remote: Index object count mismatch.
Remote: committed index: 886657 objects
Remote: rebuilt index:   886658 objects
Remote: ID: 705a9f05d1e7a8ed3fa719de2455e973b4c5401c8ee9da713c91971ac14df9e6 rebuilt index: (38206, 493217629) committed index: <not found>
Remote: Finished full repository check, errors found.
RemoteRepository: 169 B bytes sent, 746.83 kB bytes received, 3 messages sent
terminating with warning status, rc 1

elv.soft_repair2.zip pve_repair.zip pve_repair2.zip elv.soft_repair.zip

ThomasWaldmann commented 2 years ago

Guess I can't do much about this without a bigger sample of a "bad chunk".

jdchristensen commented 2 years ago

Can you add a debug command that extracts a chunk given the segment and offset?

ThomasWaldmann commented 2 years ago

@jdchristensen using PR #6722 (will merge soon into 1.2-maint):

borg debug dump-repo-objs --ghost --segment=S --offset=O LOCALREPO 

offset can be left away, then it'll dump the whole segment file, maybe even more interesting.

S and O should be valid, no checks (except "positive integer").

jdchristensen commented 2 years ago

Ok, I have that working. I extracted the chunk (18596, 199132336) in the home.borg repo on the Ubuntu machine, and it is 8816 bytes of binary data that I don't recognize. The previous and next chunks are both text files that seem to be from google-chrome. Here is the previous file:

2020/10/10-18:31:57.929 12c3 Reusing MANIFEST /home/jdc/.config/google-chrome/Default/Managed Extension Settings/cfhdojbkjhnklbpkdaibdccddilifddb/MANIFEST-000001
2020/10/10-18:31:57.929 12c3 Recovering log #3
2020/10/10-18:31:57.930 12c3 Reusing old log /home/jdc/.config/google-chrome/Default/Managed Extension Settings/cfhdojbkjhnklbpkdaibdccddilifddb/000003.log 

The segment file is dated Dec 2020.

Next I looked at chunk (24666, 60168) in the system.borg repo on the Ubuntu machine. It is a short text file from the Debian machine (which backs itself up to the Ubuntu machine). It is not a file that would change during the backup, and it is complete. The segment file here is from Oct 2021.

Finally, I looked at chunk (3516, 138963001) in the system.borg repo on the Ubuntu machine. The chunk is 524288 bytes long, and is part of a file that looks like it is related to borg or attic. It is part text, part binary, and contains things like "hardlink_master¤modé" and "root<8B>chunks". The segment file here is from 2018.

Not sure if any of this helps debug it...

jdchristensen commented 2 years ago

Oh, I do have a bit more information. I had accidentally done the above commands on the repaired repos, instead of on the (hardlinked) copies I made before repairing. I redid the commands on the pre-repair copies. Nothing changed for the first two segment files, but for the third segment file (3516 in the system.borg repo on the Ubuntu machine), I get a traceback:

# borg debug dump-repo-objs --ghost --segment=3516 /Backups/borg/system-bad.borg
Exception ignored in: <function Repository.__del__ at 0x7fd12d790940>
Traceback (most recent call last):
  File "/home/scratchy/computers/backups/borg/src/borg/repository.py", line 190, in __del__
    assert False, "cleanup happened in Repository.__del__"
AssertionError: cleanup happened in Repository.__del__
Local Exception
Traceback (most recent call last):
  File "/home/scratchy/computers/backups/borg/src/borg/archiver.py", line 5115, in main
    exit_code = archiver.run(args)
  File "/home/scratchy/computers/backups/borg/src/borg/archiver.py", line 5046, in run
    return set_ec(func(args))
  File "/home/scratchy/computers/backups/borg/src/borg/archiver.py", line 168, in wrapper
    with repository:
  File "/home/scratchy/computers/backups/borg/src/borg/repository.py", line 200, in __enter__
    self.open(self.path, bool(self.exclusive), lock_wait=self.lock_wait, lock=self.do_lock)
  File "/home/scratchy/computers/backups/borg/src/borg/repository.py", line 438, in open
    self.config.read_file(fd)
  File "/usr/lib/python3.8/configparser.py", line 718, in read_file
    self._read(f, source)
  File "/usr/lib/python3.8/configparser.py", line 1017, in _read
    for lineno, line in enumerate(fp, start=1):
  File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 5: invalid continuation byte

Platform: Linux jdc 5.4.0-113-generic #127-Ubuntu SMP Wed May 18 14:30:56 UTC 2022 x86_64
Linux: Unknown Linux  
Borg: 1.2.1.dev137+g5679eb16  Python: CPython 3.8.10 msgpack: 1.0.3 fuse: None [pyfuse3,llfuse]
PID: 945549  CWD: /tmp/borg-debug/system-3516-bad
sys.argv: ['/home/scratchy/computers/backups/borg-env/bin/borg', 'debug', 'dump-repo-objs', '--ghost', '--segment=3516', '/Backups/borg/system-bad.borg']
SSH_ORIGINAL_COMMAND: None

I get a similar error if I just ask for one chunk:

# borg debug dump-repo-objs --ghost --segment=3516 --offset=138963001 /Backups/borg/system-bad.borg
Exception ignored in: <function Repository.__del__ at 0x7f606cf01940>
Traceback (most recent call last):
  File "/home/scratchy/computers/backups/borg/src/borg/repository.py", line 190, in __del__
    assert False, "cleanup happened in Repository.__del__"
AssertionError: cleanup happened in Repository.__del__
Local Exception
Traceback (most recent call last):
  File "/home/scratchy/computers/backups/borg/src/borg/archiver.py", line 5115, in main
    exit_code = archiver.run(args)
  File "/home/scratchy/computers/backups/borg/src/borg/archiver.py", line 5046, in run
    return set_ec(func(args))
  File "/home/scratchy/computers/backups/borg/src/borg/archiver.py", line 168, in wrapper
    with repository:
  File "/home/scratchy/computers/backups/borg/src/borg/repository.py", line 200, in __enter__
    self.open(self.path, bool(self.exclusive), lock_wait=self.lock_wait, lock=self.do_lock)
  File "/home/scratchy/computers/backups/borg/src/borg/repository.py", line 438, in open
    self.config.read_file(fd)
  File "/usr/lib/python3.8/configparser.py", line 718, in read_file
    self._read(f, source)
  File "/usr/lib/python3.8/configparser.py", line 1017, in _read
    for lineno, line in enumerate(fp, start=1):
  File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 5: invalid continuation byte

Platform: Linux jdc 5.4.0-113-generic #127-Ubuntu SMP Wed May 18 14:30:56 UTC 2022 x86_64
Linux: Unknown Linux  
Borg: 1.2.1.dev137+g5679eb16  Python: CPython 3.8.10 msgpack: 1.0.3 fuse: None [pyfuse3,llfuse]
PID: 945559  CWD: /tmp/borg-debug/system-3516-bad
sys.argv: ['/home/scratchy/computers/backups/borg-env/bin/borg', 'debug', 'dump-repo-objs', '--ghost', '--segment=3516', '--offset=138963001', '/Backups/borg/system-bad.borg']
SSH_ORIGINAL_COMMAND: None
ThomasWaldmann commented 2 years ago

tracebacks: have a look into repo_dir/config - the file should be pure ascii.

jdchristensen commented 2 years ago

Weird, /Backups/borg/system-bad.borg/config is binary junk now. It's still 190 bytes long, the same as /Backups/borg/system.borg/config, but doesn't seem to have any characters in common with the original file. I did a hardlink copy of the system.borg repo to system-bad.borg, but I guess borg rewrites the config file sometimes, which broke the hard link? Not sure how the file got corrupted, though. These are on a RAID1 filesystem, with SMART tests run daily showing no errors.

KhalilSantana commented 2 years ago

Hello, I think I've hit the same bug:

I've migrated my repo from 1.1.17 to 1.2.0 soon after it's release. I strictly followed the migration guide and ran a check before and after the upgrade, as well as a compact, no problems detected since and the repo was perfectly healthy.

Until last week, where a periodic check returned this:

Remote: Starting repository index check
Remote: Index object count mismatch.
Remote: committed index: 551361 objects
Remote: rebuilt index:   551364 objects
Remote: ID: 0d9ef4dcc8412e637ad418b549c92c64114a8274f0c5d2b808dd78792fd7fc65 rebuilt index: (121, 88605140)  committed index: <not found>     
Remote: ID: 49923477c766d628c614e7793508710a1882846892aedbeff2119ca99b5df1aa rebuilt index: (17111, 7383)    committed index: <not found>     
Remote: ID: 80ea043a5d3e9706c34379e9f622ce8758366602e6b5545a2074a8535daec849 rebuilt index: (1689, 110787728) committed index: <not found>     
Remote: Finished full repository check, errors found.

I've since halted my automated backups and cloned the repo locally in case I need to do some potentially destructive commands. Host is running ArchLinux on AMD64 baremetal, source filesystem is BTRFS (no errors logged there either). The repo itself uses repokey and auto,zstd as options.

I might also have cleaned ~/.cache in between the last successful check and this one, unsure if that helps narrow it down.

jdchristensen commented 2 years ago

Here's more info about chunk (24666, 60168) in the system.borg repo on the Ubuntu machine. It's in a segment file dated Oct 2021. It's one of several short text files that I create before each system backup to aid in recovery. It is created with cat /proc/mdstat > mdstat. So it always has a new mtime and ctime, but the data is generally unchanged. I checked, and the file contains exactly the same data right now. So that chunk should still be in the repo. Maybe that is a clue?

ams-tschoening commented 2 years ago

I think I've ran into the same problem, but with Borg 1.1.6 and didn't upgrade at all.[...]

I was wrong: While I still have Borg 1.1.16 on my own server, I'm backing up using client/server mode to some Hetzner StorageBox. According to their docs, if -remote-path=borg-1.1 is not used, which I don't, Borg 1.2 is already used for the server side process.

https://community.hetzner.com/tutorials/install-and-configure-borgbackup/de#schritt-33---borg-version-am-server

ThomasWaldmann commented 2 years ago

OK, summarizing what we have now:

jdchristensen commented 2 years ago

And one of the chunks that was orphaned should not have been freed by any prune operation, since a file with that content continued to be backed up.

jdchristensen commented 2 years ago

I ran another couple of passes of memtester, with no errors.

ams-tschoening commented 2 years ago

I'm hit by this error message for the same two repos out of around 12 every few days now. All of those repos are used and checked daily and when repairing the two hit repos, the error doesn't occur anymore for the next few days. One of the repos is small enough to be copied locally onto the server, so that the following commands could be used:

[...]
Remote: checking segment file /[...]/data/23/23981...
Remote: finished segment check at segment 23981
Remote: Starting repository index check
Remote: Index object count mismatch.
Remote: committed index: 2792769 objects
Remote: rebuilt index:   2792770 objects
Remote: ID: 28eac01d646f253f847681d6e46f342d53873c905c6eb839ff3181e84232e4f5 rebuilt index: (19807, 297575235) committed index: <not found>
Remote: Finished full repository check, errors found.
RemoteRepository: 209 B bytes sent, 307.04 kB bytes received, 3 messages sent
terminating with warning status, rc 1
borg debug dump-repo-objs --ghost --segment=19807 --offset=297575235 [...]
borg debug dump-repo-objs --ghost --segment=19807 [...]

00003923_19807_297575108_put_aa8e6a8f8e1537940efb20831830d3efb7ceaa4b2d1115a0ca7dc656ff8abdb1.obj 00003924_19807_297575235_put_28eac01d646f253f847681d6e46f342d53873c905c6eb839ff3181e84232e4f5.obj 00003925_19807_297575327_put_bc566c3a092e0bf4b87b36580f703ae67837ad96ba0a7104b3898257297c7f39.obj

-331394.093750 1647730981 0.000000
1647730981
UTC
15.922
rty (private 1, global 1)
NOTICE:  FlushRelationBuffers(amdte0, 0): block 15 is dirty (private 1, global 1)
NOTICE:  FlushRelationBuffers(amdte0, 0): block 16 is dirty (private 1, global 1)
[...]

Downloads.zip

Is that of any help? I'll keep the whole dumps around for some days.

ThomasWaldmann commented 2 years ago

Looks like stuff from /etc or /var/log.