borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11.11k stars 740 forks source link

borg check errors after upgrade to 1.2.0 #6687

Closed jdchristensen closed 10 months ago

jdchristensen commented 2 years ago

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes.

Is this a BUG / ISSUE report or a QUESTION?

Possible bug.

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

1.2.0 on all clients and servers. Previous version was 1.1.17 (not 1.1.7, as I wrote on the mailing list).

Operating system (distribution) and version.

Debian buster and Ubuntu 20.04 on servers. Debian buster, Ubuntu 20.04 and Ubuntu 21.10 on clients.

Hardware / network configuration, and filesystems used.

Multiple local and remote clients accessing each repository. Repositories are on ext4, on RAID 1 mdadm devices, with spinning disks underlying them. The Debian server also uses lvm.

How much data is handled by borg?

The repos are all around 100GB in size, with up to 400 archives each. The repositories have been in use for many years.

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg check /path/to/repo [more details below]

Describe the problem you're observing.

borg check shows errors on three different repositories on two different machines. See below for details.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Yes, borg check shows the same errors when run again.

Include any warning/errors/backtraces from the system logs

I upgraded from borg 1.1.17 to 1.2.0 on several different systems on about April 9. On May 9, my monthly "borg check" runs gave errors on three repositories on two systems. Note that I use the setup where several clients do their backups into the same repositories. I don't have any non-shared repositories for comparison.

At the time of the upgrade from 1.1.17 to 1.2.0, I ran borg compact --cleanup-commits ... followed by borg check ... on all repos. There were no errors then. After that, I run borg compact without --cleanup-commits followed by borg check once per month. The errors occurred at the one month mark.

System 1 runs Ubuntu 20.04. Two of the three repos on this machine now have errors:

# borg check /Backups/borg/home.borg
Index object count mismatch.
committed index: 1166413 objects
rebuilt index:   1166414 objects
ID: 8a158ba7fdfae9b1373063a5bb5ea8ea6698c93ed7feff89ca6ff0a3c8842ebd
rebuilt index: (18596, 199132336) committed index: <not found>
Finished full repository check, errors found.

# ls -l /Backups/borg/home.borg/data/37
total 1453100
-rw------- 2 bu bu 201259844 Dec  8  2020 18596
-rw------- 2 bu bu 185611530 Dec 12  2020 18651
-rw------- 2 bu bu 125106377 Dec 25  2020 18858
-rw------- 2 bu bu 524318301 Dec 26  2020 18874
-rw------- 2 bu bu 193813842 Dec 30  2020 18940
-rw------- 2 bu bu 116657254 Dec 30  2020 18945
-rw------- 2 bu bu 141181725 Dec 31  2020 18953

# borg check /Backups/borg/system.borg
Index object count mismatch.
committed index: 2324200 objects
rebuilt index:   2324202 objects
ID: 1e20354918f4fdeb9cc0d677c28dffe1a383dd1b0db11ebcbc5ffb809d3c2b8a
rebuilt index: (24666, 60168) committed index: <not found>
ID: d9c516b5bf53f661a1a9d2ada08c8db7c33a331713f23e058cd6969982728157
rebuilt index: (3516, 138963001) committed index: <not found>
Finished full repository check, errors found.

# ls -l /Backups/borg/system.borg/data/49
total 3316136
-rw------- 2 bu bu 500587725 Oct  5  2021 24555
-rw------- 2 bu bu 168824081 Oct  8  2021 24603
-rw------- 2 bu bu 116475028 Oct  9  2021 24619
-rw------- 2 bu bu 107446533 Oct 11  2021 24634
-rw------- 2 bu bu 252958665 Oct 12  2021 24666
-rw------- 2 bu bu 124871243 Oct 19  2021 24777
-rw------- 2 bu bu 277627834 Oct 19  2021 24793
-rw------- 2 bu bu 231932763 Oct 21  2021 24835
-rw------- 2 bu bu 114031902 Oct 22  2021 24847
-rw------- 2 bu bu 127020577 Oct 26  2021 24899
-rw------- 2 bu bu 220293895 Oct 26  2021 24907
-rw------- 2 bu bu 113238393 Oct 27  2021 24933
-rw------- 2 bu bu 525154704 Oct 27  2021 24941
-rw------- 2 bu bu 291472023 Oct 27  2021 24943
-rw------- 2 bu bu 223721033 Oct 30  2021 24987

# ls -l /Backups/borg/system.borg/data/7
total 1200244
-rw------- 2 bu bu 524615544 Feb  4  2018 3516
-rw------- 2 bu bu 145502511 Feb  5  2018 3529
-rw------- 2 bu bu 266037549 Feb 21  2018 3740
-rw------- 2 bu bu 292869056 Mar 14  2018 3951

System 2 runs Debian buster. One of the three repos on this machine now has errors:

# borg check /Backups/borg/system.borg
Index object count mismatch.
committed index: 2052187 objects
rebuilt index:   2052188 objects
ID: 6b734ed388e7e086af7107847c6b6d3d34a29c20e7e539ded71b32606cb857bd
rebuilt index: (946, 15871355) committed index: <not found>
Finished full repository check, errors found.

# ls -l /Backups/borg/system.borg/data/1
total 205308
-rw------- 1 bu bu 210234581 Jun 20  2017 946

I have used borg on these systems for years, and no hardware has changed recently. System 1 has the repos on a RAID 1 mdadm device with two SATA spinning disks. System 2 also has the repos on RAID 1 mdadm devices with two SATA disks, with lvm as a middle layer. In both cases, smartctl shows no issues for any of the drives, and memtester also shows no errors.

Since the errors have happened on different machines within a month of upgrading to 1.2.0, I am concerned that this is a borg issue rather than a hardware issue. It is also suspicious to me that the error is the same in all cases, with a committed index not found. Hardware errors tend to produce garbage.

I have not run repair yet. Is there anything I should do before running repair to try to figure out the issue?

Update: there is a bounty for finding/fixing this bug: https://app.bountysource.com/issues/108445140-borg-check-errors-after-upgrade-to-1-2-0

ThomasWaldmann commented 2 years ago

One thing i need to check (again), is maybe whether the --cleanup-commits is a bit too rude (removing all 17b files), should it keep the last one maybe? The following compaction run will add a new commit at the end, but maybe we still need the previous one.

But what @jdchristensen has seen does not point in that direction.

ThomasWaldmann commented 2 years ago

from irc:

23:21  amiconn$ Hmm, 7 orphaned and 19 superseded objects (following an index rebuild) during borg check within 2 weeks after migrating to borg 1.2. With borg 1.1. there never were any repository errors...
23:22  ThomasWaldmann$ amiconn: was that the first check after switching?
23:22  amiconn$ nope
23:23  amiconn$ The first check found 3 orphaned objects, which were fixed by borg check --repair
23:24  amiconn$ I did the first check directly after migration as recommended by the instructions
23:28  amiconn$ Btw 19 superseded objects coincides with the number of archives in the repo
FabioPedretti commented 2 years ago

If useful, I am using borg-1.2 (every latest release) compact --cleanup-commits once a week since years on my 1.1 repo, and never had any check issue...

FabioPedretti commented 2 years ago

Adding some more info on my usage of borg 1.2 on my 1.1 repo:

I rarely do checks and I always use --repository-only, I just did one and never found a problem.

ThomasWaldmann commented 2 years ago

@FabioPedretti the repo-only check can only find some specific errors, but e.g. it can not know about orphaned chunks.

FabioPedretti commented 2 years ago

OK, I may try again without it in the future, do you think it should take a lot more time? Over 26h with --repository-only on a 2.62 TB Deduplicated size repo, 5429445 Unique chunks, 3113 archives.

ThomasWaldmann commented 2 years ago

A full check adds time that is proportional to archive_count * archive_size. So if you have thousands of large archives in there, it'll take long.

jdchristensen commented 2 years ago

I was going to suggest that @FabioPedretti could run a check with -v --archives-only --last 10 to save time, but I tested it and I see that it prints a message saying that it skips checking for orphans unless all archives are checked (which makes sense).

davidmehren commented 2 years ago

I'm also experiencing check errors in a bunch of borg repos. Multiple hosts run backups to a central VM. Each client has its own repo. The backups reside on an ext4 file system (VM) on ZFS (hypervisor) and are automated using borgmatic.

Example error:

Details ``` root@sharelatex2 ~# borgmatic check -v2 --progress Ensuring legacy configuration is upgraded borg --version --debug --show-rc /etc/borgmatic/config.yaml: No commands to run for pre-check hook borgbackup@10.0.42.11:backup: Running consistency checks borg check --last 3 --prefix {hostname}- --debug --show-rc --progress borgbackup@10.0.42.11:backup > using builtin fallback logging configuration 38 self tests completed in 0.09 seconds SSH command line: ['ssh', 'borgbackup@10.0.42.11', 'borg', 'serve', '--debug'] Remote: using builtin fallback logging configuration Remote: 38 self tests completed in 1.20 seconds Remote: using builtin fallback logging configuration Remote: Initialized logging system for JSON-based protocol Remote: Resolving repository path b'backup' Remote: Resolved repository path to '/mnt/borgpool/sharelatex2/backup' Remote: Starting repository check Remote: Verified integrity of /mnt/borgpool/sharelatex2/backup/index.2376 Remote: Read committed index of transaction 2376 Remote: Segment transaction is 2376 Remote: Determined transaction is 2376 Remote: Found 45 segments Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/0/6... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/0/7... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/0/8... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/0/40... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/1/1573... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2181... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2182... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2183... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2184... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2185... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2186... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2187... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2263... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2275... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2277... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2281... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2285... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2289... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2293... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2297... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2301... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2313... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2314... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2316... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2320... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2324... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2328... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2332... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2336... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2340... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2344... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2345... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2349... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2353... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2357... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2366... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2368... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2369... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2370... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2371... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2372... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2373... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2374... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2375... Remote: checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2376... Remote: finished segment check at segment 2376 Remote: Starting repository index check Remote: Index object count mismatch. Remote: committed index: 95698 objects Remote: rebuilt index: 95731 objects Remote: ID: 55afd60bf726e1907ca5f054a20004d0ae5cda1a6e8f0316a7cfaa529ac8e16a rebuilt index: (6, 21792667) committed index: Remote: ID: 210aae2323a92fe7b2335c082d36df16e32525f7de0101158c23e0f99df2d41b rebuilt index: (6, 21778215) committed index: Remote: ID: d9e679adcbf6ac4214abddfb4dc399e3ccf5c56cbc96e837091c9d07e538a5ca rebuilt index: (6, 21782622) committed index: Remote: ID: 15c700769ebec48066ebc3bad40c089261d2236805f7a572ff997dfa5c01971c rebuilt index: (6, 21798403) committed index: Remote: ID: c396c8b8dc52beee17fcc3ceaef77be1df99134f99a01b721de071974947a5a8 rebuilt index: (6, 21801337) committed index: Remote: ID: da8c90fe91e6109c8c9a47b89d85dadbbaf879416a4b7625d47cae07ef0c4ace rebuilt index: (6, 21784485) committed index: Remote: ID: b2d76dce9c7dbaf271f522ea0e8c674f110d0b8a417e507dc1cb051ffaae9ff4 rebuilt index: (6, 21780283) committed index: Remote: ID: c1f6309b1dfafa603034e073175abb42c0683289fb5b57c7b08a85a6cf6b3ed4 rebuilt index: (6, 21808344) committed index: Remote: ID: 79d8953f5098bf6f1d9f399a8f88994bc5486dd0b2ed14258198f959b0412a68 rebuilt index: (6, 21782261) committed index: Remote: ID: 9890f9ea3a16ae296a8a2490cd03f4e7089130449bc58a6e23c5520be74ab310 rebuilt index: (6, 21816759) committed index: Remote: ID: 7a38190f09fd9e2573fe622be72f1231889a2650ae279b80a5d6e98386488201 rebuilt index: (1573, 436658473) committed index: Remote: ID: cbd535cad31a66c92e4f67053b7a9f2ea1f3eff7564ed454a04d58b86183bc0e rebuilt index: (6, 21806858) committed index: Remote: ID: 48af4fc77a5e52edf2dcd8ad77932801282d6406d1cc677dff79fadc1ead0288 rebuilt index: (6, 21776672) committed index: Remote: ID: 881d01719de45efceec817cdb0895aa60090adc46346aaf102a20184e6510170 rebuilt index: (6, 21797359) committed index: Remote: ID: 15a8b868c57467981669410802ba60bf965225800d88088fe537df9c2514867b rebuilt index: (6, 21803120) committed index: Remote: ID: 612da45e9df79666240db28a0e87fcc534302a8f1b4cf894c95962d9217c4eae rebuilt index: (6, 21806670) committed index: Remote: ID: 6c6b463ca8af8f6de9b94854ed4fcba3b8c0c8f54eeadc63bec73a35c1f3c34b rebuilt index: (6, 21789677) committed index: Remote: ID: 924fc87d2ef8e7171fd5117c070a63dc73864a4b1f282e6fd528f607ab9024ef rebuilt index: (6, 21781217) committed index: Remote: ID: d6e6feef6bd962b0766874bb7c91b88b74f6e8a891bfb183e1932de2063b5fce rebuilt index: (6, 21813645) committed index: Remote: ID: f5bf430747f13a0e54017c96c5f7da3b05f3be42caffcb9a846a83f53c9e197b rebuilt index: (6, 21818673) committed index: Remote: ID: c7ce9ddbaa5c60bbc098f8bd141eb1a0a7075cd7b81b74fd312e6ad735a743d9 rebuilt index: (6, 21810845) committed index: Remote: ID: f1bc43284734130f260558ea54e73b152919ef6846f548331e4083b09d199dde rebuilt index: (6, 21803726) committed index: Remote: ID: 1ada2eeca46a2728933a02b370973e4bceceb38c5cb4a456f0f75b5ae485918c rebuilt index: (6, 21804751) committed index: Remote: ID: 22f25b077b3a83940d4966b0a5e66755d8ed0a0c1c194323091c52439311ae83 rebuilt index: (6, 21817261) committed index: Remote: ID: 6400af83ed831249c091c746ea8a6b617bf92fccb962eb77ab9b40778f730c84 rebuilt index: (6, 21779214) committed index: Remote: ID: c529bbcbec00ca8871192b1b279a1c901d87af9e477bcd51e32a406228cff27a rebuilt index: (6, 21783511) committed index: Remote: ID: b181fa9bd13c62ddeeb74d2aac4c60ad34a78931cedf42595553607b475c9232 rebuilt index: (6, 21816997) committed index: Remote: ID: ee36d51a621b55e89108dd936b07af6bfe9f7fc4e2915ccc4f635ffde62180bb rebuilt index: (6, 21805533) committed index: Remote: ID: 986199f1c61bb4cf15820a61d0620d3367511a1abb99d06eb96ebd99cc7df4b5 rebuilt index: (6, 21771790) committed index: Remote: ID: 13b6f6569e6668f25ed5c3c2fd912d2a0de370dc31895253c68b4efb24d4a508 rebuilt index: (6, 21800197) committed index: Remote: ID: 727a253e708aaba4d00912de021b859d90ac7f8568d5f6390c76278e5bf7ccd6 rebuilt index: (6, 21777280) committed index: Remote: ID: 02e45d45c93274827d8632ff274dcf989bbf56d25c04d77259400bd6032d5938 rebuilt index: (6, 21820915) committed index: Remote: ID: 4e57b24ccca5ccf46f9086cd0dc14662032db30f7dec8cc90dcf836df3d82b89 rebuilt index: (6, 21791845) committed index: Remote: Finished full repository check, errors found. RemoteRepository: 177 B bytes sent, 28.40 kB bytes received, 3 messages sent terminating with warning status, rc 1 ```

I was able to repair the errors using check --repair but in at least one repo the next check one week after had errors again.

Both clients and server currently run borg 1.2.0.

I just upgraded the server to 1.2.1 to use borg debug dump-repo-objs and successfully extracted a few files, which all look fine.

ThomasWaldmann commented 2 years ago

@davidmehren running a full check --repair, not just --repository-only would be interesting.

What the above log shows is that there are objects in the segment files which were not in the committed index. But even more interesting would be whether these objects are actually referenced by archives (then this would be a relatively severe error) or not (orphaned / superceded objects), which would be rather harmless (but still unpretty).

davidmehren commented 2 years ago

I ran check --repair directly on the central server, on the same repo as in the error message above:

# borg --debug check --repair .
using builtin fallback logging configuration                                                                                    
33 self tests completed in 1.13 seconds                                                                                         
This is a potentially dangerous function.                                                                                       
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!                                                                                      

Type 'YES' if you understand this and want to continue: YES
Starting repository check      
Verified integrity of /mnt/borgpool/sharelatex2/backup/index.2376
Read committed index of transaction 2376
Cleaned up 0 uncommitted segment files (== everything after segment 2376).
Segment transaction is    2376                             
Determined transaction is 2376
Found 45 segments                                                                                                               
checking segment file /mnt/borgpool/sharelatex2/backup/data/0/6...
checking segment file /mnt/borgpool/sharelatex2/backup/data/0/7...
checking segment file /mnt/borgpool/sharelatex2/backup/data/0/8...
checking segment file /mnt/borgpool/sharelatex2/backup/data/0/40...
checking segment file /mnt/borgpool/sharelatex2/backup/data/1/1573...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2181...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2182...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2183...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2184...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2185...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2186...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2187...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2263...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2275...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2277...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2281...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2285...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2289...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2293...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2297...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2301...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2313...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2314...     
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2316...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2320...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2324...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2328...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2332...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2336...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2340...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2344...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2345...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2349...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2353...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2357...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2366...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2368...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2369...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2370...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2371...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2372...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2373...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2374...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2375...
checking segment file /mnt/borgpool/sharelatex2/backup/data/2/2376...
finished segment check at segment 2376
Starting repository index check
Finished full repository check, no problems found.
Starting archive consistency check...
Verified integrity of /mnt/borgpool/sharelatex2/backup/index.2376
Enter passphrase for key /mnt/borgpool/sharelatex2/backup: 
TAM-verified manifest
Manifest is TAM verified and says TAM is required, updating security database...
Analyzing archive sharelatex2-2022-05-22T05:19:44.740423 (1/20)
Analyzing archive sharelatex2-2022-05-23T05:34:52.894630 (2/20)
Analyzing archive sharelatex2-2022-05-24T05:51:01.223206 (3/20)
Analyzing archive sharelatex2-2022-05-25T05:39:18.517606 (4/20)
Analyzing archive sharelatex2-2022-05-26T05:40:29.177019 (5/20)
Analyzing archive sharelatex2-2022-05-27T05:39:12.190087 (6/20)
Analyzing archive sharelatex2-2022-05-28T05:37:59.986794 (7/20)
Analyzing archive sharelatex2-2022-05-29T05:27:02.741374 (8/20)
Analyzing archive sharelatex2-2022-05-30T05:19:20.174900 (9/20)
Analyzing archive sharelatex2-2022-05-31T05:26:10.903314 (10/20) 
Analyzing archive sharelatex2-2022-06-01T05:17:07.432204 (11/20) 
Analyzing archive sharelatex2-2022-06-02T05:32:13.477160 (12/20) 
Analyzing archive sharelatex2-2022-06-03T05:48:06.501439 (13/20) 
Analyzing archive sharelatex2-2022-06-04T05:52:45.158290 (14/20) 
Analyzing archive sharelatex2-2022-06-05T05:47:08.629424 (15/20) 
Analyzing archive sharelatex2-2022-06-08T05:19:29.677737 (16/20) 
Analyzing archive sharelatex2-2022-06-09T05:06:04.154792 (17/20) 
Analyzing archive sharelatex2-2022-06-10T05:03:51.264720 (18/20) 
Analyzing archive sharelatex2-2022-06-11T05:01:53.968764 (19/20) 
Analyzing archive sharelatex2-2022-06-12T05:52:51.009114 (20/20) 
48 orphaned objects found!
Deleting 48 orphaned and 20 superseded objects...
Cleaned up 0 uncommitted segment files (== everything after segment 2376).
Verified integrity of /mnt/borgpool/sharelatex2/backup/hints.2376
Finished deleting orphaned/superseded objects.
Writing Manifest.
Committing repo.
check_free_space: required bytes 566240528, free bytes 1671414366208
Archive consistency check complete, problems found.
ams-tschoening commented 2 years ago

[...]or not (orphaned / superceded objects), which would be rather harmless (but still unpretty).

I've ran into the issue the last week each and every day for ONE of the two former repos. The last two days the issue didn't occur. But when I repaired, logs ALWAYS mentioned some orphaned and superseded objects. I can't remember any other result besides check telling everything is OK.

jdchristensen commented 2 years ago

My latest scheduled borg check runs just happened, and again there is an error with the system.borg repo on "system 1" which runs Ubuntu 20.04:

Starting repository index check
Index object count mismatch.
committed index: 2489296 objects
rebuilt index:   2489297 objects
ID: ca44bf70d3d28eafbbae6ccf87841d2af6a13fd06617db185fd5ef3669b38b0b
rebuilt index: (26830, 446211143) committed index: <not found>

That segment file has an mtime of Feb 17, 2022. I have run checks since then without errors. The chunk at that offset is some ascii text:


11 updates can be applied immediately.
2 of these updates are standard security updates.
To see these additional updates run: apt list --upgradable

If anyone knows where I would expect to find that in the filesystem, let me know and I'll look for it.

The other repos on system 1 and system 2 are fine.

ThomasWaldmann commented 2 years ago

@jdchristensen That could be output from apt update or something related. Maybe /var/log/apt/...?

jdchristensen commented 2 years ago

Ah, this kind of thing is contained in /var/lib/update-notifier/updates-available. It's different from what is currently in that file on any machine, so it was probably pruned at some point.

jdchristensen commented 2 years ago

I ran a repair on this repo and got:

1 orphaned objects found!
Deleting 1 orphaned and 291 superseded objects...

Before doing the repair, I saved a hardlinked copy of the faulty repo in case it can help with debugging. As happened above, the config file for the hardlinked copy is now corrupt! I created a new issue for this: #6768

jdchristensen commented 2 years ago

BTW, I still have the hardlinked copy of that repo from the first time I got corruption. In that copy, this chunk would probably have been present, and was not an orphan according to the check run then. Is there some debugging I could do using that old copy?

ThomasWaldmann commented 2 years ago

Well, if that is a more severe error, logs about and content of that chunk would be interesting.

jdchristensen commented 2 years ago

It was the same error (the one that started this issue). To summarize: I got two orphaned chunks around May 9. I saved a copy of the repo as "copy 1". I repaired the repo (but not the copy). On June 12, I got another orphaned chunk in the repo. I saved a copy of the repo as "copy 2". I repaired the repo (but not the copy). Now "copy 1" (from early May) might have some information about the chunk that was detected as orphaned on June 12. For example, can I use "copy 1" to find out which archives referenced the chunk at that time?

jdchristensen commented 2 years ago

I'm also trying to reproduce this issue using the script below, which tries to mimic my usage of borg (no encryption, several "hosts" doing independent pruning, occasional "file changed" errors, etc.). So far no errors after running it for an hour or two on three machines. Any suggestions for tweaks to the script that might make it more likely to reveal a bug?

#!/bin/sh

# Usage:
#   cd /tmp [or maybe a tmpfs, for speed]
#   /path/to/script | tee script.out
# It will create a subfolder borgtest of the current directory
# and do all of its work in there.

# Abort if any command exits with a nonzero exit code:
set -e

borg -V

TESTDIR=borgtest
rm -rf ${TESTDIR}
CONTENT=${TESTDIR}/content
mkdir -p ${CONTENT}

export BORG_REPO=${TESTDIR}/test.repo
borg init -e none -v

# Create some unchanging files:
for k in `seq 1 100`; do echo ${k} > ${CONTENT}/static.${k}; done

while true; do
    # Change a few files that will be the same for all hosts:
    for k in `seq 1 4`; do
    dd if=/dev/urandom of=${CONTENT}/randomshared.${k} count=1 bs=10000 2> /dev/null
    done
    # For each "host", change some files, create an archive, and do pruning:
    for i in `seq 1 4`; do
        for k in `seq 1 5`; do
        dd if=/dev/urandom of=${CONTENT}/random.${k} count=1 bs=10000 2> /dev/null
        done
    # Create a larger file that will change during the backup.
    # The size and the number of background writes may need to be adjusted.
        dd if=/dev/urandom of=${CONTENT}/changing count=1 bs=1M 2> /dev/null
    (for j in `seq 5000`; do date >> ${CONTENT}/changing; done &)
    # Create a file that is unreadable:
    date > ${CONTENT}/date.out
    chmod a-r ${CONTENT}/date.out
    # Do the backup;  we put "|| true" at the end since we expect two warnings:
        borg create -v ::host${i}-{now} ${CONTENT} || true
    # Prune independently for each host using -P:
    borg prune -v --list --keep-last 2 --keep-minutely 10 --keep-hourly 10 -P host${i}
    done
    borg compact -v
    borg check -v
    echo
done
ThomasWaldmann commented 2 years ago

There is borg debug dump-archive and borg debug dump-archive-items. Not sure if they output all the needed chunk references in readable form.

jdchristensen commented 2 years ago

borg debug dump-archive asks for a PATH argument, but I don't know what to supply there. borg debug dump-archive-items produced a bunch of binary files that I couldn't figure out.

ThomasWaldmann commented 2 years ago

For dump-archive, PATH is the to-be-created output file.

Yeah, I've already suspected that dump-archive-items just writes the decrypted, decompressed binary items to files (but does not decode them to readable output).

jdchristensen commented 2 years ago

dump-archive does give data in a usable json format, but it's slow and large, and I'd have to run through all archives in order to try to find that chunk...

jdchristensen commented 2 years ago

I searched for offset 446211143 in the archives made in the 2.5 months leading up to Feb 17, 2022 (the mtime of the segment file), and didn't find it. So I suspect that it had already been pruned by the time the May 9 corruption happened and I saved a copy of the repo. It did not show up as an orphan at that time, so it appears that borg correctly knew that it was deleted. Something between May 9 and June 12 seems to have caused it to become an orphan. Puzzling.

I've also been running the stress testing script, now using client/server mode via ssh to localhost. Still no orphans being produced.

ams-tschoening commented 2 years ago

Something which I find strange as well. The following is from yesterday after some days of silence:

Remote: Starting repository index check
Remote: Index object count mismatch.
Remote: committed index: 930105 objects
Remote: rebuilt index:   930107 objects
Remote: ID: 705a9f05d1e7a8ed3fa719de2455e973b4c5401c8ee9da713c91971ac14df9e6 rebuilt index: (38206, 493217629) committed index: <not found>
Remote: ID: a42aef79f2c8ea7d1becf62dc84c7879bf94b4b32801944f15a87d752c7b9bb7 rebuilt index: (5678, 332609646) committed index: <not found>
Remote: Finished full repository check, errors found.

The following is from today:

Remote: Starting repository index check
Remote: Index object count mismatch.
Remote: committed index: 930428 objects
Remote: rebuilt index:   930431 objects
Remote: ID: dfba4c044abdd4fe58135983fea9faa2731f7134c3b9f7753dc517342ab53452 rebuilt index: (85234, 387659010) committed index: <not found>
Remote: ID: 705a9f05d1e7a8ed3fa719de2455e973b4c5401c8ee9da713c91971ac14df9e6 rebuilt index: (38206, 493217629) committed index: <not found>
Remote: ID: bc45831ca445b420bf652246c9e67d2f4406ba726a7c6493253d2fe813b840a4 rebuilt index: (82632, 242255975) committed index: <not found>
Remote: Finished full repository check, errors found.

The first ID of yesterday and second one of today are the same, while others are not, even though I run the same checks every day and didn't repair yesterday. Otherwise the one repeating ID wouldn't be there most likely.

Does that mean anything? I would have expected the list of IDs simply grow for each found problem with the same IDs repeating and newer ones being added only.

matmo commented 2 years ago

Maybe another data point: Since upgrading to 1.2 (Arch Linux) one repository (rootfs on btrfs to local repo on zfs) consistently throws Index count errors every few days, while the other (data dir on zfs to local repo on zfs) has never shown any issues.

ThomasWaldmann commented 2 years ago

@ams-tschoening interesting. like that chunk got deleted from repo index but not from segment files (and thus reappeared in the rebuilt index).

@matmo also interesting, but could also be a coincidence.

ams-tschoening commented 2 years ago

@ams-tschoening interesting. like that chunk got deleted from repo index but not from segment files (and thus reappeared in the rebuilt index).

Two things I thought about: Might this have anything to do with increased threshold of segment file cleanup to 25 %? I use ZFS for storing BORG repos, am creating additional snapshots daily and the amount of changed data in the snapshots is far less with that value than in the past. Though, the problem started to happen at all only after Hetzner put BORG 1.2 in place in their SBOX, using 10 % by default already.

Second thought came into my mind after trying to debug this myself and looking at some code. I found the following lengthy explanation of things:

# Now we crash. But only segment 2 gets deleted, while segment 1 is still around. Now key 1
# is suddenly undeleted (because the delete in segment 2 is now missing).
# Again, note the requirement here. We delete these in the correct order that this doesn't happen,
# and only if the FS materialization of these deletes is reordered or parts dropped this can happen.
# In this case it doesn't cause outright corruption, 'just' an index count mismatch, which will be
# fixed by borg-check --repair.

https://github.com/borgbackup/borg/blob/a3a1974c52c4acbea6bde17cf5729f0a3668914f/src/borg/repository.py#L837

I can't remember BORG ever crashing, BUT some months ago I accidently deleted some archives of some topic from one of the repos, while I really wanted to keep the oldest one. I couldn't easily re-create that archive and therefore decided to rollback the entire ZFS snapshot for that day, pretty much doing the following?

“this is either an attack or unsafe” warning

https://borgbackup.readthedocs.io/en/latest/faq.html#this-is-either-an-attack-or-unsafe-warning

Afterwards I ran multiple checks and backups etc. successfully for additional weeks/months until the warnings of this issue so didn't expect things to be related. But looking at the above comment, with my rollback I might have simply triggered exactly that state, didn't I? Some chunk might have been properly deleted in a newly created segment file which I removed with my rollback. No actual bug in BORG at all then...

OTOH, what about all other users seeing this? Did you all do a similar rollback with your repos at some point? :-) Additionally, I would expect these types of problems being recognized before version 1.2 already, as the code seems to be older.

jdchristensen commented 2 years ago

The code comment highlighted by @ams-tschoening is from compact_segments(). I run that locally once a week and have never seen crashes then, so I don't think it is directly relevant to the errors I am seeing.

I also did not do any rollback or anything else unusual. However, I do many backups to these repos, and follow them with prune commands, and very occasionally there is a network error during the backup or prune which causes it to die. When an error occurs during a backup, I would expect the usual rollback procedure to leave things in a consistent state. But what if there is a crash during a prune operation? Again, since compaction is a separate step, I would expect that a crash during a prune shouldn't lead to the above, but maybe I'm wrong?

ThomasWaldmann commented 2 years ago

@ams-tschoening borg 1.2 got some changes on the repo segment file level to have less stuff moving around, likely that's why your zfs snapshots are smaller. I don't think the compaction percentage has anything to do with this issue.

FelixSchwarz commented 2 years ago

Maybe this helps in narrowing the area where to look: I think I hit the problem yesterday on a CentOS 7 machine using borg 1.1.18 while the target is a Hetzner Storagebox which uses borg 1.2. Never accessed the repo with a borg 1.2 client. Also I did not experience network issues (since the last passing check). At least no issue which would abort a backup process.

I guess looking at the dumped chunk is not really interesting but if it is I'd need way to decrypt the chunk first.

ThomasWaldmann commented 2 years ago

OK, so guess we can assume, it is the 1.2.x repository code, likely the compaction part.

jdchristensen commented 2 years ago

Now a different repository on a different host is showing errors during check. This repository was only accessed by one client, over a LAN. The remote host is running Ubuntu 20.04 and borg 1.2.0. Another data point that I think I didn't mention above is that I run the fat binaries on all machines.

jdchristensen commented 2 years ago

Oh, and one of the other repos that had errors previously has errors again. I've lost count of how many times my monthly checks have found errors, but it is happening quite often for me.

ThomasWaldmann commented 2 years ago

https://github.com/borgbackup/borg/issues/6899#issuecomment-1198406243 noticed something there which would explain that an on-disk repo index can be a bit behind the (intermediate-)committed repo state if compact_segments is interrupted (write_index is only called once at the very end).

but does it explain the effects seen here?

ams-tschoening commented 2 years ago

[...]if compact_segments is interrupted[...]

I don't see that happening for me.

[...]in general, users are advised to run borg prune frequently (like daily or weekly)[...]

Am doing exactly that, daily.

jdchristensen commented 2 years ago

I run prune after every backup and do several backups per day into my repositories. One a month, I do compact followed by check, and a surprising number of those checks show errors. But the logs show that the compact commands complete successfully. The reproducible example in #6781 also shows that the error is produced by compact, even though it completes successfully. Are there places in the compact code where (possibly temporary) assertions could be added to catch a bug? The poster in #6781 could run this code using their reproducible example.

ThomasWaldmann commented 2 years ago

@jdchristensen an idea for your reproducer script: maybe create a file that always has same contents, but latest ctime/mtime (== directly before invoking borg create). like the mdstat file you noticed above.

files with latest timestamp get special treatment with the files cache. if it is additionally a file that always has same contents, it creates a rather special case (but I can't imagine yet how that could lead to problems).

ThomasWaldmann commented 2 years ago

btw, yesterday i took a large repo and ran a lot of del / prune / compact / check and also some create ops on it, but failed to find any issue (using 1.2-maint code). if used zfs snapshots to keep the state before/after compact, so that in case a problem shows up, i have both states for further analysis.

ThomasWaldmann commented 2 years ago

@jdchristensen can you use the code from #6918 and run borg compact with --debug?

jdchristensen commented 2 years ago

@ThomasWaldmann I have updated my stress testing script to sometimes touch unchanged files at the end. I also changed the repo and chunk parameters in a way that should cause much more churn during compaction. I'm currently running the new scripts in tmpfs directories on two machines, but haven't found any borg check errors yet. New script attached. borg-stress.txt

I will try the new debug code as soon as I get a chance.

jdchristensen commented 2 years ago

I ran my borg-stress script for 1.5 days on a tmpfs. It did 70000 create+prune commands and 1765 compact+check commands without any errors. There are over 28000 numbered folders in the data/ folder (many empty). The main difference between this script and my repos that have had problems is that the repos that have had problems were created with older versions of borg, probably 1.1.0b3 in 2017. Could that somehow affect things?

ThomasWaldmann commented 2 years ago

Hard to say before knowing the actual root cause. The compaction code has 1 relatively complex condition (where I added that debug logging recently), the rest of it is rather simple.

The entries in the segment files did not change before borg2, it's just PUT, DEL and COMMIT.

borg 1.1 did always compact at the end of repo writing commands (and did not need to persist some compaction related infos to disk), while 1.2 compacts separately (and needs to persist these infos into the hints file).

jdchristensen commented 2 years ago

I ran borg compact --debug on two repos and got several of the "dropping DEL for id" messages. But the following borg check runs didn't show any errors this time. I'll try again in the future.

ThomasWaldmann commented 2 years ago

Moving this to 1.2.3 milestone to get 1.2.2 out quicker with other important fixes. I'll merge the debug log PR so that all 1.2.2 users can easily use that.

jdchristensen commented 1 year ago

I adapted my stress testing script to use borg2, and I can now consistently reproduce the orphans problem. A version of the script that excludes some unnecessary parts is attached. Instructions are at the top. I have run it 15 times, and 14 times it showed orphans during the first run of borg check (after four creates have been done and one compact, but no prunes) and 1 time it happened during the second run of borg check. (Different random data and chunking seed can affect it, I guess.) It only works if some of those creates are killed, resulting in checkpoint archives. Sample output is attached.

Hopefully this helps find the bug! I have no idea why my script successfully reproduces the problem for borg2, even though it never showed a problem with borg1, even after many hours of running.

Edit: the errors my script found are fixed by #7305, and were a borg2-specific bug. So currently my stress testing scripts can't reproduce any of the issues people are noticing with borg compact.

borg2-stress-trim.txt borg2-stress-output.txt

jdchristensen commented 1 year ago

And I just noticed that the problem happens even without the borg compact line. So with no pruning and no compacting, some orphans arise. Is it just because borg is killed during a create operation?

ThomasWaldmann commented 1 year ago

Maybe. IIRC we have a ticket about that borg maybe creates orphans when content chunks for a file have been written, but not the archive item for that file (which is written after all content chunks).

Thanks for the script, I'll try it!

jdchristensen commented 1 year ago

There is evidence from others, at least with borg 1.2, that a borg repo can pass borg check without errors, but then fail after borg compact is run on it. So my script seems to be revealing a different problem than the one in this issue. When borg is killed with the default signal, does it try to clean things up, e.g. by writing a checkpoint archive? What happens if it is part-way through processing a large file?