bd830dbd7432 commented 5 years ago

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes.

Is this a BUG / ISSUE report or a QUESTION?

Issue.

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

Currently 1.1.10 on client and server, using the official binary, but the issue persists all the way from 1.1.4. Always used official binaries from this page.

Operating system (distribution) and version.

The issues persists from Debian 8 (Jessie) to Debian 10 (Buster), on client and/or server.

Hardware / network configuration, and filesystems used.

Various Dell PowerEdges, with ECC RAM, disks in RAID-10 (mdadm), gigabit network, ext4 filesystems.

How much data is handled by borg?

We backup ~1TB with Borg on total, but using several smaller repositories by VM/data type.

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg create -vsx --files-cache=mtime,size --show-rc --one-file-system --list --filter AME -C zstd borg@borg1.xxxxxxxxx:postgresql::{now} .

Describe the problem you're observing.

We use Borg for backups for almost two years. Most data are backup-ed from an LVM snapshot. Most are backup-ed every hour, but some data (ex: databases) are backed up every 10 minutes. Some VMs (qcow2 disks) are backup-ed up daily (they are shutdown off-hours for the backup).

With this setup, since we started using Borg, we usually experience 1 or 2 problems per month with Borg repository corruption. In every case, we could fix the problem with borg check --repair.

Every backup is done using SSH.

We had this issue backing-up from VMs and physical servers.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

We could never reproduce the issue.

We checked every disk from every server with a complete S.M.A.R.T. test, with no error detected.

We checked RAM with memtest86 for hours and nothing were found.

We checked that every backup BEFORE the backup job that broke finished successfully.

Destination server never got short on RAM or disk space.

Include any warning/errors/backtraces from the system logs

 + borg create -vsx --files-cache=mtime,size --show-rc --one-file-system --list --filter AME -C zstd borg@borg1.xxxxxxxxxxx:postgresql::{now} .
Creating archive at "borg@borg1.xxxxxxxxxxx:postgresql::{now}"
M 9.6/main/base/141585/171760
M 9.6/main/base/16507/156896
M 9.6/main/base/16507/1259
M 9.6/main/base/16507/2619
M 9.6/main/base/16507/16551
M 9.6/main/base/16507/16577
M 9.6/main/base/16507/16582
M 9.6/main/base/16507/16606
M 9.6/main/base/16507/16731
M 9.6/main/base/16507/16606_fsm
M 9.6/main/base/16507/17268
M 9.6/main/base/16507/17274
M 9.6/main/base/16507/17326
M 9.6/main/base/16507/17327
M 9.6/main/base/16507/16606_vm
M 9.6/main/pg_clog/0003
A 9.6/main/pg_xlog/000000010000001200000078
M 9.6/main/global/pg_control
Traceback (most recent call last):

  File "borg/remote.py", line 247, in serve

  File "borg/repository.py", line 442, in commit

  File "borg/repository.py", line 789, in compact_segments

AssertionError: Corrupted segment reference count - corrupted index or hints

Borg server: Platform: Linux vm-borg-1 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u3 (2019-06-16) x86_64
Borg server: Linux: debian 9.9
Borg server: Borg: 1.1.10  Python: CPython 3.5.7 msgpack: 0.5.6
Borg server: PID: 15222  CWD: /srv/borg/root@vm-postgres-1
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/srv/borg/root@vm-postgres-1']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Platform: Linux vm-postgres-1 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u3 (2019-06-16) x86_64
Linux: debian 9.9
Borg: 1.1.10  Python: CPython 3.5.7 msgpack: 0.5.6
PID: 3953  CWD: /mnt
sys.argv: ['borg', 'create', '-vsx', '--files-cache=mtime,size', '--show-rc', '--one-file-system', '--list', '--filter', 'AME', '-C', 'zstd', 'borg@borg1.xxxxxxxxxxx:postgresql::{now}', '.']
SSH_ORIGINAL_COMMAND: None

terminating with error status, rc 2

ThomasWaldmann commented 5 years ago

You said you could fix the issue by running borg check --repair - did it output any additional information / warnings / error messages (when using -v)?

ThomasWaldmann commented 5 years ago

Some other stuff I noticed (not necessarily related to the issue, but just wondering):

-x and --one-file-system is the same, you only need it once.
why do you use --files-cache=mtime,size? if it is ext4, the default should work ok.

bd830dbd7432 commented 5 years ago

Greetings!

Some other stuff I noticed (not necessarily related to the issue, but just wondering):
* `-x` and `--one-file-system` is the same, you only need it once.

This is redundant, you're right.

* why do you use `--files-cache=mtime,size`? if it is ext4, the default should work ok.

Because when using LVM snapshots, files seems (or seemed) to be re-scanned by Borg, and files not changed were printed by --list --filter AME. This happened on ext4 and XFS on the past, and now we standardize on using --files-cache=mtime,size.

bd830dbd7432 commented 5 years ago

You said you could fix the issue by running borg check --repair - did it output any additional information / warnings / error messages (when using -v)?

On 2019-02-06 we had a corruption on borg create:

M 9.6/main/base/16507/17338
M 9.6/main/pg_clog/0001
A 9.6/main/pg_xlog/000000010000000800000034
M 9.6/main/global/pg_control
M 9.6/main/pg_subtrans/0018
Traceback (most recent call last):

  File "borg/remote.py", line 248, in serve

  File "borg/repository.py", line 430, in commit

  File "borg/repository.py", line 777, in compact_segments

AssertionError: Corrupted segment reference count - corrupted index or hints

Borg server: Platform: Linux vm-borg-1 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64
Borg server: Linux: debian 9.7
Borg server: Borg: 1.1.8  Python: CPython 3.5.6
Borg server: PID: 23309  CWD: /srv/borg/root@vm-postgres-1
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/srv/borg/root@vm-postgres-1']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Platform: Linux vm-postgres-1 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64
Linux: debian 9.7
Borg: 1.1.8  Python: CPython 3.5.6
PID: 25777  CWD: /mnt
sys.argv: ['borg', 'create', '-vsx', '--files-cache=mtime,size', '--show-rc', '--one-file-system', '--list', '--filter', 'AME', '-C', 'zstd', 'borg@borg1.xxxxxxxxxxxxxxxx:postgresql::{now}', '.']
SSH_ORIGINAL_COMMAND: None

terminating with error status, rc 2

Result from borg check -vp:

Starting repository check
Starting repository index check                                                                                      
Index mismatch for key b'+\xd7dq\x8c\xf2\xf59)\xdc\xc1\xca\x98\x05\xb9\x08\x9f#5Ji$\xa3\x9c\xc0]\x81\xacK\x89\x10\xd0'. (45623, 21329774) != (-1, -1)
Completed repository check, errors found.

Result from borg check -vp --repair:

'check --repair' is an experimental feature that might result in data loss.
Type 'YES' if you understand this and want to continue: YES
Starting repository check
Starting repository index check                                                                                      
Completed repository check, no problems found.                                                                        
Starting archive consistency check...
Analyzing archive 2019-01-06T23:57:03 (1/227)
Analyzing archive 2019-01-13T23:57:04 (2/227)
(.....) <- Only repeating lines with archive names
Analyzing archive 2019-01-20T23:57:04 (3/227)
Analyzing archive 2019-01-27T23:57:04 (4/227)
Analyzing archive 2019-02-06T13:37:05 (226/227)
Analyzing archive 2019-02-06T13:47:05 (227/227)
Writing Manifest.
Committing repo (may take a while, due to compact_segments)...
Finished committing repo.                                                                                            
Archive consistency check complete, no problems found.

After repairing, we ran borg check -vp again, but it didn't show any error this time (don't have the output saved).

ThomasWaldmann commented 5 years ago

Index mismatch for key b'...'. (45623, 21329774) != (-1, -1)

(45623, 21329774) / left is (segment, offset) of rebuilt index (in memory).

(-1, -1) / right is (segment, offset) of on-disk index - (-1, -1) means no entry for that key.

so the committed repo contents had a chunk with that key (id), but it was not in the on-disk index, which is strange.

note: with --repair, these warnings are not emitted, but the in-memory index is written to disk.

ThomasWaldmann commented 5 years ago

When looking at the 1.1.8 code, this message is only emitted if the on-disk index size and the in-memory rebuilt index size is the same.

So, if one key is not present on disk, that means that there is another key present on disk. The current code just does not find out which key that is.

I'll change that code to display both in-memory and on-disk key, so that e.g. a corruption of a few bits can be seen, see #4831.

ThomasWaldmann commented 5 years ago

You said you did a complete SMART test, does that mean a SMART long test (smartctl -t long)?

bd830dbd7432 commented 5 years ago

You said you did a complete SMART test, does that mean a SMART long test (smartctl -t long)?

Yes, using smartctl -t long.

In every disk:

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     14507         -

ThomasWaldmann commented 5 years ago

@bd830dbd7432 I once had the embarrassing experience of an usb hub silently corrupting the data on their way from the harddisk to the PC's memory.

borg noticed it due to its 256bit MACs, but there was no indication at all in the kernel log. Disks and memory were also ok, like in your case.

It was quite rare corruption, maybe 1bit per 100GB..1TB. I found the culprit by removing the USB hub, then the problem went away. Inserted it again, problem reappeared. Did this multiple times and it always behaved like that.

You don't have an USB hub in your data path I guess, but it could be also something else.

Another place of failure could be also multi-bit RAM errors (too much for ECC to cope with) or some bit flip that is outside an ECC protected path.

Even CRCs or ECCs used by HW are sometimes too weak to detect corruption. Or some stuff is not protected by such at all.

bd830dbd7432 commented 5 years ago

Hello Thomas,

The disks are attached via SATA ports on the server.

The strange thing is that we never noticed any corruption on other backups, in this same server, from different methods (eg: tar). This server aggregates backups for other servers, in different VMs.

For example, we have SQL Server backups being written to this server (inside another VM), with ~100GB, and they are restored nightly on another server. We have other backups (via tar) being written (and already encrypted on source). Daily, we have jobs that copy those backups to other servers, and they are being decrypted correctly, without a sign of corruption.

I don't have much knowledge in Borg internals, but it seem to me that only indexes are being corrupted, aren't they? Data chunks do not seem to be lost, and after borg check --repair, backups are validated by borg check and can be restored. Given that fact that data is much bigger than indexes, what odds are that only indexes are getting corrupted? But I may be talking nonsense...

ThomasWaldmann commented 5 years ago

The difference is that content data is only going once over the cable/interface.

Indexes are loaded, modified, stored. And if you do backups very frequently (like every 10mins), indexes get touched often.

But that reminds me: we have cryptographic strength checksums over index data since a while, so on-disk / on-sata-cable corruption should be noticed. It does not help against in-memory corruption that happens before computing the checksum and storing the index though.

ThomasWaldmann commented 5 years ago

Could you locate the root cause of the problem?

bd830dbd7432 commented 5 years ago

Hello Thomas, thanks for your attention.

I ran memtester inside the VM that Borg stores the repositories, and on the physical server. No problem was detected.

Until now, I couldn't find any hardware issue, and there were no freezing/corruption problem on other VMs running on the same physical server, besides Borg repositories.

A complete Dell hardware diagnostic didn't find any problem either.

I'm just waiting for another corruption to happen, so I can run memtester inside the Borg VM, trying to pinpoint a bad RAM segment mapped to the VM.

ThomasWaldmann commented 5 years ago

Above you said you ran memtest86 - I usually use memtest86+, did you use the same? It requires to be booted, so the server would be offine during the test.

bd830dbd7432 commented 5 years ago

Yeah, that was Memtest86+ 5.01. Server was offline through the night for the test.

bd830dbd7432 commented 4 years ago

so the committed repo contents had a chunk with that key (id), but it was not in the on-disk index, which is strange.

Just a wild guess...

Our VMs and servers have a maintenance window of one hour at night, where patches are applied and they are rebooted if needed (not everything on the same day). This includes servers which Borg make backups and servers which Borg stores the backups.

On 2019-10-21, I changed all Borg backups and repositories' check jobs, so they do not run inside this maintenance window, avoiding a reboot while Borg is busy.

Usually, Borg --check is run everyday.

At 2019-11-04, two different Borg repositories were corrupted, but this problem didn't happen again until today. Output for one of them:

 + borg check -v --last 1 borg@borg1.xxxxxxxxxxxxxxx:vms
Remote: Starting repository check
Remote: Starting repository index check
Remote: Index mismatch for key b'b\x9a\xb3\x9c\xc7\x87\x0c9\x84\x90\x96\xd3p2\x12\xf4\x1e\x1c\xd5S\xf3#7\xc5\xb6\x1c5U(\xc0\x8d\x18'. (893, 25399177) != (-1, -1)
Remote: Completed repository check, errors found.

(I know this may be another problem, sorry for adding on this issue)

Could this sort of corruption happen while Borg is backing up, pruning, and the process is aborted for a reboot? And this problem be detected later, on other backups and pruning jobs?

Whenever we had corruptions, we always check if Borg were killed by some problem (error, reboot, out of memory, etc), but I don't have a clear evidence of this.

ThomasWaldmann commented 4 years ago

Theoretically, reboot, kills or connection breakdowns should not affect repo or indexes. If the repo or cache is not committed, it should just roll back to last valid (committed) state.

bd830dbd7432 commented 4 years ago

Had another corruption this night:

+ export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/bin:/bin
+ export BORG_UNKNOWN_UNENCRYPTED_REPO_ACCESS_IS_OK=yes
+ export BORG_RSH=ssh -o StrictHostKeyChecking=no
+ trap cleanup EXIT
+ borg init -e none borg@borg1.xxxxxxxxxxxxxx:clog
A repository already exists at borg@borg1.xxxxxxxxxxxxxx:clog.
+ true
+ lvcreate -n clog-snap -L 3g -s vg0/clog
  Using default stripesize 64.00 KiB.
  Logical volume "clog-snap" created.
+ mount -t ext4 -o ro /dev/vg0/clog-snap /mnt
+ cd /mnt
+ borg create -vsx --files-cache=mtime,size --show-rc --one-file-system --exclude lost+found --list --filter AME -C zstd borg@borg1.xxxxxxxxxxxxxx:clog::{now} .
Creating archive at "borg@borg1.xxxxxxxxxxxxxx:clog::{now}"
M clogd.db
A clogd.db-wal
M clogd.db-shm
Traceback (most recent call last):

  File "borg/remote.py", line 247, in serve

  File "borg/repository.py", line 442, in commit

  File "borg/repository.py", line 789, in compact_segments

AssertionError: Corrupted segment reference count - corrupted index or hints

Borg server: Platform: Linux vm-borg-1 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20) x86_64
Borg server: Linux: debian 10.1
Borg server: Borg: 1.1.10  Python: CPython 3.5.7 msgpack: 0.5.6
Borg server: PID: 29775  CWD: /srv/borg/root@vm-clog-1
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/srv/borg/root@vm-clog-1']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Platform: Linux vm-clog-1 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64
Linux: debian 9.11
Borg: 1.1.10  Python: CPython 3.5.7 msgpack: 0.5.6
PID: 5503  CWD: /mnt
sys.argv: ['borg', 'create', '-vsx', '--files-cache=mtime,size', '--show-rc', '--one-file-system', '--exclude', 'lost+found', '--list', '--filter', 'AME', '-C', 'zstd', 'borg@borg1.xxxxxxxxxxxxxx:clog::{now}', '.']
SSH_ORIGINAL_COMMAND: None

terminating with error status, rc 2
+ cleanup
+ ORIG=2
+ cd /
+ umount /mnt
+ lvremove -f vg0/clog-snap
  Logical volume "clog-snap" successfully removed
+ exit 2

I made a copy of the repository before repairing.

Running borg check on this copy:

$ borg check -vp borg-clog/
Starting repository check
Starting repository index check                                                                                       
Index mismatch for key b'>\xa2\xb0\xdb\x95\xb1O\x100~)Sdsr\x17\xd2,\xb3\xc8{\xef\xc5\x9b\xd7$\xdbz\x0c\xceS\x85'. (21922, 66184646) != (-1, -1)
Completed repository check, errors found.

ThomasWaldmann commented 4 years ago

As the above is still the output of 1.1.10 without #4831, it doesn't give us more information.

bd830dbd7432 commented 4 years ago

c3430191e778c85 [xxxxxxx@xxx borg (1.1-maint)]$ /usr/local/bin/borg -V

borg 1.1.10

c3430191e778c85 [xxxxxxx@xxx borg (1.1-maint)]$ /usr/local/bin/borg check /srv/tmp/borg-clog/

Index mismatch for key b'>\xa2\xb0\xdb\x95\xb1O\x100~)Sdsr\x17\xd2,\xb3\xc8{\xef\xc5\x9b\xd7$\xdbz\x0c\xceS\x85'. (21922, 66184646) != (-1, -1)
Completed repository check, errors found.

c3430191e778c85 [xxxxxxx@xxx borg (1.1-maint)]$ /home/xxxxxxx/.local/share/virtualenvs/c3430191e778c85/bin/borg -V

borg 1.1.11.dev95+g21bbac79

c3430191e778c85 [xxxxxxx@xxx borg (1.1-maint)]$ /home/xxxxxxx/.local/share/virtualenvs/c3430191e778c85/bin/borg check /srv/tmp/borg-clog/

ID: 3ea2b0db95b14f10307e295364737217d22cb3c87befc59bd724db7a0cce5385 rebuilt index: (21922, 66184646) committed index: <not found>

ThomasWaldmann commented 4 years ago

Was that the complete output of 1.1.11.dev95?

bd830dbd7432 commented 4 years ago

Yes. Tested again:

65de1674b53dc75 [xxxxxxxx@xxxx borg (1.1-maint)]$ $HOME/.local/share/virtualenvs/65de1674b53dc75/bin/borg check /srv/tmp/borg-clog/
ID: 3ea2b0db95b14f10307e295364737217d22cb3c87befc59bd724db7a0cce5385 rebuilt index: (21922, 66184646) committed index: <not found>

With check --v:

65de1674b53dc75 [xxxxxxx@xxxxx borg (1.1-maint)]$ $HOME/.local/share/virtualenvs/65de1674b53dc75/bin/borg check -v /srv/tmp/borg-clog/
Starting repository check
Starting repository index check
Index object count match.
ID: 3ea2b0db95b14f10307e295364737217d22cb3c87befc59bd724db7a0cce5385 rebuilt index: (21922, 66184646) committed index: <not found>     
Completed repository check, no problems found.
Starting archive consistency check...
Analyzing archive 2019-11-03T22:12:04 (1/38)
Analyzing archive 2019-11-10T22:12:04 (2/38)
Analyzing archive 2019-11-17T22:12:04 (3/38)
Analyzing archive 2019-11-24T22:12:04 (4/38)
Analyzing archive 2019-11-28T22:12:05 (5/38)
Analyzing archive 2019-11-29T22:12:05 (6/38)
Analyzing archive 2019-11-30T22:12:04 (7/38)
Analyzing archive 2019-12-01T22:12:04 (8/38)
Analyzing archive 2019-12-02T22:12:04 (9/38)
Analyzing archive 2019-12-03T22:12:05 (10/38)
Analyzing archive 2019-12-04T22:12:05 (11/38)
Analyzing archive 2019-12-05T03:12:05 (12/38)
Analyzing archive 2019-12-05T04:12:05 (13/38)
Analyzing archive 2019-12-05T05:12:05 (14/38)
Analyzing archive 2019-12-05T06:12:05 (15/38)
Analyzing archive 2019-12-05T08:12:05 (16/38)
Analyzing archive 2019-12-05T09:12:05 (17/38)
Analyzing archive 2019-12-05T10:12:05 (18/38)
Analyzing archive 2019-12-05T11:12:05 (19/38)
Analyzing archive 2019-12-05T12:12:04 (20/38)
Analyzing archive 2019-12-05T13:12:05 (21/38)
Analyzing archive 2019-12-05T14:12:04 (22/38)
Analyzing archive 2019-12-05T15:12:05 (23/38)
Analyzing archive 2019-12-05T16:12:05 (24/38)
Analyzing archive 2019-12-05T17:12:04 (25/38)
Analyzing archive 2019-12-05T18:12:05 (26/38)
Analyzing archive 2019-12-05T19:12:05 (27/38)
Analyzing archive 2019-12-05T20:12:04 (28/38)
Analyzing archive 2019-12-05T21:12:04 (29/38)
Analyzing archive 2019-12-05T22:12:05 (30/38)
Analyzing archive 2019-12-06T01:12:05 (31/38)
Analyzing archive 2019-12-06T02:12:08 (32/38)
Analyzing archive 2019-12-06T03:12:05 (33/38)
Analyzing archive 2019-12-06T04:12:10 (34/38)
Analyzing archive 2019-12-06T05:12:05 (35/38)
Analyzing archive 2019-12-06T06:12:04 (36/38)
Analyzing archive 2019-12-06T08:12:05 (37/38)
Analyzing archive 2019-12-06T09:12:08 (38/38)
Archive consistency check complete, no problems found.

With check --v --debug (if it makes sense):

65de1674b53dc75 [xxxxxxx@xxx borg (1.1-maint)]$ $HOME/.local/share/virtualenvs/65de1674b53dc75/bin/borg check -v --debug /srv/tmp/borg-clog/
using builtin fallback logging configuration
35 self tests completed in 0.05 seconds
Starting repository check
Verified integrity of /srv/tmp/borg-clog/index.21927
Read committed index of transaction 21927
Segment transaction is    21927
Determined transaction is 21927
Found 8950 segments
Starting repository index check
Index object count match.
ID: 3ea2b0db95b14f10307e295364737217d22cb3c87befc59bd724db7a0cce5385 rebuilt index: (21922, 66184646) committed index: <not found>     
Completed repository check, no problems found.
Starting archive consistency check...
Verified integrity of /srv/tmp/borg-clog/index.21927
TAM-verified manifest
Analyzing archive 2019-11-03T22:12:04 (1/38)
Analyzing archive 2019-11-10T22:12:04 (2/38)
Analyzing archive 2019-11-17T22:12:04 (3/38)
Analyzing archive 2019-11-24T22:12:04 (4/38)
Analyzing archive 2019-11-28T22:12:05 (5/38)
Analyzing archive 2019-11-29T22:12:05 (6/38)
Analyzing archive 2019-11-30T22:12:04 (7/38)
Analyzing archive 2019-12-01T22:12:04 (8/38)
Analyzing archive 2019-12-02T22:12:04 (9/38)
Analyzing archive 2019-12-03T22:12:05 (10/38)
Analyzing archive 2019-12-04T22:12:05 (11/38)
Analyzing archive 2019-12-05T03:12:05 (12/38)
Analyzing archive 2019-12-05T04:12:05 (13/38)
Analyzing archive 2019-12-05T05:12:05 (14/38)
Analyzing archive 2019-12-05T06:12:05 (15/38)
Analyzing archive 2019-12-05T08:12:05 (16/38)
Analyzing archive 2019-12-05T09:12:05 (17/38)
Analyzing archive 2019-12-05T10:12:05 (18/38)
Analyzing archive 2019-12-05T11:12:05 (19/38)
Analyzing archive 2019-12-05T12:12:04 (20/38)
Analyzing archive 2019-12-05T13:12:05 (21/38)
Analyzing archive 2019-12-05T14:12:04 (22/38)
Analyzing archive 2019-12-05T15:12:05 (23/38)
Analyzing archive 2019-12-05T16:12:05 (24/38)
Analyzing archive 2019-12-05T17:12:04 (25/38)
Analyzing archive 2019-12-05T18:12:05 (26/38)
Analyzing archive 2019-12-05T19:12:05 (27/38)
Analyzing archive 2019-12-05T20:12:04 (28/38)
Analyzing archive 2019-12-05T21:12:04 (29/38)
Analyzing archive 2019-12-05T22:12:05 (30/38)
Analyzing archive 2019-12-06T01:12:05 (31/38)
Analyzing archive 2019-12-06T02:12:08 (32/38)
Analyzing archive 2019-12-06T03:12:05 (33/38)
Analyzing archive 2019-12-06T04:12:10 (34/38)
Analyzing archive 2019-12-06T05:12:05 (35/38)
Analyzing archive 2019-12-06T06:12:04 (36/38)
Analyzing archive 2019-12-06T08:12:05 (37/38)
Analyzing archive 2019-12-06T09:12:08 (38/38)
Archive consistency check complete, no problems found.

ThomasWaldmann commented 4 years ago

Hmm, looks like your repo has 1 item more than your index.

My suspicion was that 1 index key is not in the repo and 1 repo key is not in the index (that could happen by corruption of key bits). But that seems to not being the case.

borkd commented 4 years ago

@bd830dbd7432 - is borg server running inside a VM or on bare metal? Can you describe your stack and backup process a bit more precisely (from the physical disk(s) to the server, and clients; including any volume managers, filesystem(s), mount options, etc..)? Nothing in your original post indicates what steps did you take to set up your back-end so it at least detects and reports ongoing or spreading corruption before corrupted data can be seen by consumers (borg).

bd830dbd7432 commented 4 years ago

Hi, @borkd.

Clients

Everything running on Dell PowerEdges with ECC RAM, redundant PSU units.
Bare metal or VMs. Most are VMs, but some virtualization hosts use Borg to backup entire VMs.
Operating System: Debian 9 or Debian 10.
Storage: SAS or SATA disks, enterprise-grade. RAID-1 or RAID-10 using software RAID via mdadm.
Partitioning: bare metal uses simple partitions, and VMs uses LVM.
Filesystem: ext4, with default options.

Borg "server"

Host running on Dell PowerEdge with ECC RAM.
VM using KVM, managed by libvirt.
Operating System: Debian 10, but corruption was happening on Debian 9 too.
Storage: virtual disk using qcow2 format. with VirtIO SCSI interface, cache mode writeback, with discard unmap enabled (VM runs periodic fstrim to discard unused blocks).
Inside the VM, using LVM for volume management.
Borg volume using ext4, using default mount options.

Backup method

For virtualization hosts

Some VMs are backed-up nightly via scripts:

Stop VM.
Backup qcow2 image via Borg.
Run borg check and borg prune after every backup.
Start VM.

For VMs

Some VMs are backed-up every 10 minutes, others every hour:

Do a LVM snapshot.
Mount snapshot.
Backup data using Borg from snapshot.
Umount snapshot.
Delete snapshot.
Once per day, run borg check and borg prune.

Observations

We have observed (and registered) corruptions since 01/2018.
Borg is used since 12/2016.
We always used the same physical server to store backups, but we replaced the disks recently (last year, but after corruptions started).
This server stores other backups, like simple encrypted tarballs and SQL Server backups. Those are checked daily (they are decrypted and restored on other servers) and we never found problems.
This server does not have a history of crashing, freezing, strange error messages from kernel, etc, and its VMs never had strange issues.

Let me know if you need more information. Thanks!

ThomasWaldmann commented 4 years ago

"We have observed (and registered) corruptions since 01/2018."

Do you have logs/docs whether you changed the borg version in between it working correctly and not working correctly any more? If so, what version was it before/after?

bd830dbd7432 commented 4 years ago

We install/update Borg via Ansible, so we have the history on Git. Approximated dates (we roll updates spreading some days):

2019-05-23 - Borg 1.1.9 -> Borg 1.1.10 2019-03-09 - Borg 1.1.8 -> Borg 1.1.9 2019-01-29 - Borg 1.1.7 -> Borg 1.1.8 2018-08-28 - Borg 1.1.6 -> Borg 1.1.7 2018-06-18 - Borg 1.1.5 -> Borg 1.1.6 2018-04-24 - Borg 1.1.4 -> Borg 1.1.5 2018-01-02 - Borg 1.1.3 -> Borg 1.1.4 <-- first logged corruption after this 2017-12-18 - Borg 1.0.11 -> Borg 1.1.3 2017-09-27 - Borg 1.0.9 -> Borg 1.0.11 2017-01-23 - Borg 1.0.8 -> Borg 1.0.9 2016-11-29 - Started using Borg 1.0.8

borkd commented 4 years ago

It could be that one of the many possible components in the path is flaky or quietly failing. Mix of qcow2, writeback cache and ext4 on mdadm mirror while fast, does not offer much in terms of ensuring data integrity once luck runs out.

If you have a chance make the machine running borg server a standby redirecting backups to a fresh, clean repo hosted elsewhere (different power circuit if you can) and check if the errors follow. If you can, do setup the new repo on ZFS, and use writethrough or direct access.

ThomasWaldmann commented 4 years ago

Are we sure that the corruption began between 1.1.3 and 1.1.4 and not between 1.0.11 and 1.1.3?

bd830dbd7432 commented 4 years ago

This is the first registered corruption that we got. Borg version 1.1.4. I don't remember why Borg downloaded the chunks cache, maybe because before running check --repair (which solved the problem), I deleted ~/.cache/borg on the client.

+ nice ionice -c3 borg create -vsx --files-cache=mtime,size --show-rc --one-file-system --exclude lost+found --list --filter AME -C lzma borg@borg1.xxxxxxxxxxx:mysql::1516960621 .
Synchronizing chunks cache...
Archives: 37, w/ cached Idx: 36, w/ outdated Idx: 0, w/o cached Idx: 1.
Reading cached archive chunk index for 1516413422 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516859822 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516845422 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516935422 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516877822 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516910222 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516881421 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1515981421 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516953421 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516586222 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516939021 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516946222 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516924622 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516931822 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516906621 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516499822 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516942622 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516917422 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516327022 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516892222 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516885022 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516672622 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516870622 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516867021 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516903022 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516928221 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516949822 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516759022 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1515376622 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516921022 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516874221 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516899421 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516913821 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516888622 ...
Merging into master chunks index ...
Fetching and building archive index for 1516957021 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516895821 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516863422 ...
Merging into master chunks index ...
Done.
A ibdata1
A ib_logfile0
A xxxxxxx_assembly2/xxxxxxxxx.ibd
A xxxxxxx_assembly2/xxxxxxxxxSummaries.ibd
A xxxxxxx_assembly2/config.ibd
A xxxxxxx_assembly/xxxxxxxxxSummaries.ibd
A xxxxxxx_assembly/config.ibd
A xxxxxxx_assembly/xxxxxxx.ibd
A xxxxxxx_assembly/xxxxxxxxx.ibd
A mysql/innodb_table_stats.ibd
A mysql/innodb_index_stats.ibd
Traceback (most recent call last):

  File "borg/remote.py", line 248, in serve

  File "borg/repository.py", line 423, in commit

  File "borg/repository.py", line 770, in compact_segments

AssertionError: Corrupted segment reference count - corrupted index or hints

Borg server: Platform: Linux haskell 3.16.0-5-amd64 #1 SMP Debian 3.16.51-3+deb8u1 (2018-01-08) x86_64
Borg server: Linux: debian 8.10
Borg server: Borg: 1.1.4  Python: CPython 3.5.4
Borg server: PID: 14173  CWD: /srv/borg/root@vm-mysql-1
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/srv/borg/root@vm-mysql-1']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Platform: Linux vm-mysql-1 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64
Linux: debian 9.3
Borg: 1.1.4  Python: CPython 3.5.4
PID: 21835  CWD: /mnt
sys.argv: ['borg', 'create', '-vsx', '--files-cache=mtime,size', '--show-rc', '--one-file-system', '--exclude', 'lost+found', '--list', '--filter', 'AME', '-C', 'lzma', 'borg@borg1.xxxxxxxxxxx:mysql::1516960621', '.']
SSH_ORIGINAL_COMMAND: None

terminating with error status, rc 2

ThomasWaldmann commented 4 years ago

https://github.com/borgbackup/borg/blob/master/docs/changes.rst#pre-114-potential-data-corruption-issue

Just to get that off the table (if it does not apply): did you have that problem while running 1.1.3?

ThomasWaldmann commented 4 years ago

I also had a look at the changelog for 1.1.4, but nothing there that seems to have to do with your problem:

https://github.com/borgbackup/borg/blob/master/docs/changes.rst#version-114-2017-12-31

bd830dbd7432 commented 4 years ago

https://github.com/borgbackup/borg/blob/master/docs/changes.rst#pre-114-potential-data-corruption-issue

Just to get that off the table (if it does not apply): did you have that problem while running 1.1.3?

I have no record of that on our issue tracking system.

However, we only used Borg 1.1.3 for a few days:

2018-01-02 - Borg 1.1.3 -> Borg 1.1.4 2017-12-18 - Borg 1.0.11 -> Borg 1.1.3

bd830dbd7432 commented 4 years ago

Another RAM test inside Borg server VM. The VM has 2GB RAM, and for the test, it was allocated 1.6GB. It ran for 2 hours without detecting problems:

 root@vm-borg-1:~# free
              total        used        free      shared  buff/cache   available
Mem:        2042904       92804     1829812        9200      120288     1809976
Swap:       2094076        9692     2084384
root@vm-borg-1:~# timeout 2h memtester 1600m
memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 1600MB (1677721600 bytes)
got  1600MB (1677721600 bytes), trying mlock ...locked.
Loop 1:
  Stuck Address       : ok        
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok        
  Block Sequential    : ok        
  Checkerboard        : ok        
  Bit Spread          : ok        
  Bit Flip            : ok        
  Walking Ones        : ok        
  Walking Zeroes      : ok        
  8-bit Writes        : ok
  16-bit Writes       : ok

Loop 2:
  Stuck Address       : ok        
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok        
  Block Sequential    : ok        
  Checkerboard        : ok        
  Bit Spread          : ok        
  Bit Flip            : ok        
  Walking Ones        : ok        
  Walking Zeroes      : ok        
  8-bit Writes        : ok
  16-bit Writes       : ok

(...)

Loop 17:
  Stuck Address       : ok        
  Random Value        : \root@vm-borg-1:~#

bd830dbd7432 commented 4 years ago

I created a script for "stressing" Borg backups, with the goal of reproducing the corruption problem. The script is running on my work machine, and Borg repo is on the same VM that stores production Borg repos, which are having corruption problems.

The script is running for 17 hours straight without a sign of corruption:

#!/bin/bash

set -e -u -x -o pipefail

DIR="/srv/tmp/borg-stress"
REPO="borg@borg1.xxxxxxxxxxxxx:borg-stress"

mkdir -p "$DIR"
borg init -e none "$REPO" || true

trap date EXIT # print date on exit

while :; do
    # A random number between 1 and 100000, for using with file
    # names. Thus, we well have a maximum of 100000 files in the
    # test repo.
    set +x
    # Creating/modifying 10 files at a time for backup.
    for _ in $(seq 10); do
        random=$(shuf -i 1-100000 -n 1)
        # Writing random bytes, with $random size.
        dd if=/dev/urandom of="${DIR}/${random}" bs="${random}" count=1 2>/dev/null
    done
    set -x
    borg create -vps -C zstd "${REPO}::{now}" "$DIR"
    # Probability of running on 1 in 15 (sort of?).
    r=$(( RANDOM % 15 ))
    if (( r == 0 )); then
        borg check -vp "$REPO"
        borg prune -vp --list --keep-minutely 15 "$REPO" # churn!
    fi
done

bd830dbd7432 commented 4 years ago

I made some tweaks on the script, and it ran for three days. Couldn't reproduce repository corruption.

#!/bin/bash

set -e -u -x -o pipefail

DIR="/srv/tmp/borg-stress"
REPO="borg@borg1.xxxxxxxxxxxxxxxx:borg-stress"

mkdir -p "$DIR"
borg init -e none "$REPO" || true

trap date EXIT # print date on exit

while :; do
    # A random number between 1 and 500000, for using with file
    # names. Thus, we well have a maximum of 500000 files in the
    # test repo.
    set +x
    for i in $(seq 10000); do
        random=$(shuf -i 1-500000 -n 1)
        dd if=/dev/urandom of="${DIR}/${random}" bs="$((i * 2))" count=1 2>/dev/null
    done
    set -x
    borg create -vps -C none "${REPO}::{now}" "$DIR"
    # Probability of running on 1 in 15 (sort of?).
    r=$(( RANDOM % 15 ))
    if (( r == 0 )); then
        borg check -vp "$REPO"
        borg prune -vp --list --keep-minutely 30 "$REPO" # churn!
    fi
done

bd830dbd7432 commented 4 years ago

Recreated Borg server VM from scratch on another physical server, already in production and without any historical issues. Let's see how it goes.

pgerber commented 4 years ago

I'm seeing a very similar issue here. I'm still trying to figure out if there is some hardware issue but I already tried #4831.

Output from Borg ~1.0.9~ 1.1.9 (as shipped with Debian 10 (Buster):

$ borg check --verify --debug ~/borg/host03d.old/
using builtin fallback logging configuration
37 self tests completed in 0.07 seconds
Starting repository check
Verified integrity of /home/user/borg/host03d.old/index.12820
Read committed index of transaction 12820
Segment transaction is    12820
Determined transaction is 12820
Found 5591 segments
finished segment check at segment 12820
Starting repository index check
Index mismatch for key b'C\xb9\x06P.\x948\xfb\t0r\xd2\x0c\xed:l]\xa1\xa3l\xb3\x85t\x96\x8b\x1b;\xb2\xb6\xab\x99\x00'. (12703, 201309690) != (-1, -1)
Finished full repository check, errors found.

Output using the 1.1-maint branch:

$ ~/.local/bin/borg --version
borg 1.1.11.dev107+g0d9e88d9
$ ~/.local/bin/borg check --verify --debug ~/borg/host03d.old/
using builtin fallback logging configuration
35 self tests completed in 0.08 seconds
Starting repository check
Verified integrity of /home/user/borg/host03d.old/index.12820
Read committed index of transaction 12820
Segment transaction is    12820
Determined transaction is 12820
Found 5591 segments
Starting repository index check
Index object count match.
ID: 43b906502e9438fb093072d20ced3a6c5da1a36cb38574968b1b3bb2b6ab9900 rebuilt index: (12703, 201309690) committed index: <not found>     
Completed repository check, no problems found.
Starting archive consistency check...
Verified integrity of /home/user/borg/host03d.old/index.12820
Starting cryptographic data integrity verification...
Repo/Chunks index object count vs. segment files object count mismatch.
Repo/Chunks index: 168180 objects != segment files: 168179 objects
Finished cryptographic data integrity verification, verified 168179 chunks with 0 integrity errors.
TAM-verified manifest
Analyzing archive main-2019-11-24T23:21:02 (1/117)
Analyzing archive main-2019-12-01T23:21:01 (2/117)
Analyzing archive main-2019-12-08T23:21:02 (3/117)
Analyzing archive vm_100-2019-12-09T19:47:15 (4/117)
Analyzing archive vm_102-2019-12-10T19:30:02 (5/117)
Analyzing archive vm_103-2019-12-11T20:12:55 (6/117)
…
Analyzing archive main-2020-01-03T15:21:01 (117/117)
Archive consistency check complete, no problems found.

There doesn't appear to be any more information in the output and the check is now considered succeeded. Was that intended?

pgerber commented 4 years ago

I modified Borg to write the committed index and the recreated index to disk. When I diff the two, there is no difference yet still a missing item is reported:

user@tocco-borg:/tmp$ ~/.local/bin/borg check --verify --debug ~/borg/host03d.old/
using builtin fallback logging configuration
35 self tests completed in 0.08 seconds
Starting repository check
Verified integrity of /home/user/borg/host03d.old/index.12820
Read committed index of transaction 12820
Segment transaction is    12820
Determined transaction is 12820
Found 5591 segments
Starting repository index check
Index object count match.
ID: 43b906502e9438fb093072d20ced3a6c5da1a36cb38574968b1b3bb2b6ab9900 rebuilt index: (12703, 201309690) committed index: <not found>     
Completed repository check, no problems found.
Starting archive consistency check...
…
user@tocco-borg:/tmp$ ls -lh *.idx
-rw------- 1 user user 14M Jan  4 17:10 disk.idx
-rw------- 1 user user 14M Jan  4 17:10 mem.idx
user@tocco-borg:/tmp$ sort disk.idx >disk.idx.sorted
user@tocco-borg:/tmp$ sort mem.idx >mem.idx.sorted
user@tocco-borg:/tmp$ diff disk.idx.sorted mem.idx.sorted
user@tocco-borg:/tmp$

I also tested a small change that copied the indexes into a new dict. After that no more error was reported:

$ ~/.local/bin/borg check --verify --debug ~/borg/host03d.old/
using builtin fallback logging configuration
35 self tests completed in 0.08 seconds
Starting repository check
Verified integrity of /home/user/borg/host03d.old/index.12820
Read committed index of transaction 12820
Segment transaction is    12820
Determined transaction is 12820
Found 5591 segments
Starting repository index check
Index object count match.
Completed repository check, no problems found.
Starting archive consistency check...
Verified integrity of /home/user/borg/host03d.old/index.12820
Starting cryptographic data integrity verification...
…

Either I'm missing something here or something is going horribly wrong.

ThomasWaldmann commented 4 years ago

@pgerber Debian buster has borgbackup 1.1.9, not 1.0.9.

If you meant that, please edit your post.

ThomasWaldmann commented 4 years ago

ID: 43b906502e9438fb093072d20ced3a6c5da1a36cb38574968b1b3bb2b6ab9900
     rebuilt index: (12703, 201309690) committed index: <not found>     
Completed repository check, no problems found.

That looks incorrect. I'll have a look at the code...

Starting archive consistency check...
...
Repo/Chunks index object count vs. segment files object count mismatch.
Repo/Chunks index: 168180 objects != segment files: 168179 objects
...
Archive consistency check complete, no problems found.

Also a bit strange, but might be correct if that one chunk is not used by any archive.

ThomasWaldmann commented 4 years ago

/home/user/borg/host03d.old/index.12820 could you give that file to me for further analysis?

The repo index does not contain private data btw, it just maps ID -> (segment, offset). No filenames, no file contents, no file meta data in there.

ThomasWaldmann commented 4 years ago

Your "small change" is interesting, should behave the same as before, but does not.

I'ld like to reproduce that with your index file and dig in a bit deeper to see whether that is a bug, a corrupted hash table or a hw issue.

ThomasWaldmann commented 4 years ago

BTW, if you could keep the repo (or a copy of it in "unrepaired" state) that produces this strange effect, that might be useful in case I can not reproduce / find the issue just with the index.

pgerber commented 4 years ago

@pgerber Debian buster has borgbackup 1.1.9, not 1.0.9.

If you meant that, please edit your post.

Corrected.

/home/user/borg/host03d.old/index.12820 could you give that file to me for further analysis?

The repo index does not contain private data btw, it just maps ID -> (segment, offset). No filenames, no file contents, no file meta data in there.

The index as requested: index.12820.xz.

BTW, if you could keep the repo (or a copy of it in "unrepaired" state) that produces this strange effect, that might be useful in case I can not reproduce / find the issue just with the index.

Will do.

bd830dbd7432 commented 4 years ago

Hello!

Unfortunately, we had another backup corruption on 2019-12-24, three days after reinstalling Borg VM server on another physical server (https://github.com/borgbackup/borg/issues/4829#issuecomment-568184376):

Creating archive at "borg@borg1.xxxxxxxxxxxxxxxxx:prometheus::{now}"
A wal/00006332
A wal/00006331
Traceback (most recent call last):

  File "borg/remote.py", line 247, in serve

  File "borg/repository.py", line 442, in commit

  File "borg/repository.py", line 789, in compact_segments

AssertionError: Corrupted segment reference count - corrupted index or hints

Borg server: Platform: Linux vm-borg-1 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64
Borg server: Linux: debian 10.2 
Borg server: Borg: 1.1.10  Python: CPython 3.5.7 msgpack: 0.5.6
Borg server: PID: 28652  CWD: /srv/borg/root@vm-prometheus-1
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/srv/borg/root@vm-prometheus-1']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Platform: Linux vm-prometheus-1 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64
Linux: debian 10.1 
Borg: 1.1.10  Python: CPython 3.5.7 msgpack: 0.5.6
PID: 10113  CWD: /mnt
sys.argv: ['borg', 'create', '-vsx', '--files-cache=mtime,size', '--show-rc', '--one-file-system', '--exclude', 'lost+found', '--list', '--filter', 'AME', '-C', 'zstd', 'borg@borg1.xxxxxxxxxxx:prometheus::{now}', '.']
SSH_ORIGINAL_COMMAND: None

terminating with error status, rc 2

No more corruption after this day, until now.

ThomasWaldmann commented 4 years ago

OK, finally having some time to look into this - working on this now...

ThomasWaldmann commented 4 years ago

OK, just as an update: I have played with the index file given by @pgerber.

That hashtable has one entry that is found by NSIndex.iteritems(), but not by NSIndex.__getitem__ or NSIndex.__contains__. This is why copying all stuff into a python dict fixes the issue.

The reason why the entry is not found is that hashindex_lookup starts at index 163136, finds an empty bucket at that index and gives up.

That is correct behavior of the algorithm. Finding an empty bucket while looking for a bucket with a matching key means that we are finished, because either we should find a used bucket with another key (hash function collision), a deleted bucket (which once was used for a collision, but then was deleted - a tombstone entry) or a bucket with the correct key.

The bucket with the key we search for is at index 163139.

163136 b'0000000000000000000000000000000000000000000000000000000000000000' b'ffffffff' b'00000000'
163137 b'dbdc7a7f2935228de94cc586d6892df219ffeea1e4019e05d10913fe82fa005f' b'5d2a0000' b'a598fc0a'
163138 b'4fea6a91e72554fc70722f86c80d8862ddd9b747e25c86327346121181ef43ca' b'feffffff' b'58f3401c'
163139 b'43b906502e9438fb093072d20ced3a6c5da1a36cb38574968b1b3bb2b6ab9900' b'9f310000' b'fabdff0b'

Now have to find out the root cause why the index file looks like that.

ThomasWaldmann commented 4 years ago

@bd830dbd7432 @pgerber thanks a lot for finding this and helping here in this ticket.

I guess I have found the root cause and fixed it in #4967.

The problem was not that the hashindex lookup was broken in the code, but that the hashtable was brought into an invalid state by hashindex_set writing the new entry to an incorrect index in the hashtable. So it was there (and thus found by iteration), but could not get discovered by lookup.

This happened under these conditions:

trying to hashindex_set some value
hashtable not having too many entries (not too full)
set wants to use an empty bucket and by doing that, minimum amount of empty buckets goes below threshold and triggers a hashtable rebuild ("same size")
hashtable rebuild may change positions of entries in the new table (compared to old table) as well as amount and positions of empty buckets and if that makes the index we wanted to use incorrect for the current set operation, the bug occurred because the index was not recomputed after the hashtable rebuild.

ThomasWaldmann commented 4 years ago

Consequences of bug for the hashtables / borg indexes:

set operation puts data at incorrect place
- if that place was empty, only check/lookup fails, no other harm
- worse: if that place was used, another entry gets overwritten in the hashtable
check (key in hashtable) or lookup (hashtable[key]) fails although entry exists, iteration will still find it.
for the worse case, that other entry is lost in the hashtable, iteration also won't find it any more

So, pretty bad as far as the hashtables / indexes are concerned.

Affects:

repo index (mapping chunk id -> segment, offset)
chunks index (mapping chunk id -> size, csize, refcount)

But there are good news:

seems to be not too frequent
borg check repair can rebuild the index from repo segment files, so there is no data loss.

borgbackup / borg

Borg repositories constantly getting corrupted on server grade hardware #4829

Have you checked borgbackup docs, FAQ, and open Github issues?

Is this a BUG / ISSUE report or a QUESTION?

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

Operating system (distribution) and version.

Hardware / network configuration, and filesystems used.

How much data is handled by borg?

Full borg commandline that lead to the problem (leave away excludes and passwords)

Describe the problem you're observing.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Include any warning/errors/backtraces from the system logs

Clients

Borg "server"

Backup method

For virtualization hosts

For VMs

Observations