borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11k stars 739 forks source link

Reproducible "Data integrity error: Invalid segment entry size 0" on freshly created archive on SMB network fs #8409

Closed zackw closed 3 days ago

zackw commented 5 days ago

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

Bug

System information. For client/server mode post info for both machines.

The machine running borg: Gentoo Linux 2.15 (kernel 6.6.47) The SMB file server: TrueNAS CORE 13.0-U6.2, ZFS filesystem, spinning disks Local wired network (Ethernet)

Both have x86-64 CPUs.

$ mount | grep cifs
//[redacted]/backup on /media/backup type cifs (rw,nosuid,nodev,relatime,vers=3.1.1,cache=strict,username=storage,uid=0,noforceuid,gid=0,noforcegid,addr=[redacted],file_mode=0644,dir_mode=0755,soft,nounix,setuids,serverino,mapposix,noperm,reparse=nfs,rsize=4194304,wsize=4194304,bsize=1048576,retrans=1,echo_interval=60,actimeo=1,closetimeo=1)

Your borg version (borg -V).

1.4.0

How much data is handled by borg?

                       Original size      Compressed size    Deduplicated size
All archives:              244.97 GB            206.51 GB            199.99 GB

                       Unique chunks         Total chunks
Chunk index:                  285569               398818

Full borg commandline that lead to the problem (leave away excludes and passwords)

Where /media/backup is the SMB mount described above:

# succeeds
borg init -e authenticated /media/backup/borg

# succeeds
borg create /media/backup/borg/2024-09-20 /home/zackw

# throws errors as described below
borg check /media/backup/borg

Describe the problem you're observing.

Backing up my home directory on my primary workstation to a SMB remote filesystem, the very first archive seems to be corrupted during creation - a "borg check" immediately after the very first "borg create" produces floods of errors like

Data integrity error: Invalid segment entry size 0 - too small [segment 4, offset 266609618]
Data integrity error: Invalid segment entry size 0 - too small [segment 5, offset 225477143]
Data integrity error: Invalid segment entry size 0 - too small [segment 6, offset 372770211]
...
Index object count mismatch.
committed index: 285570 objects
rebuilt index:   169951 objects
ID: 27905ae9fd06b277fedd9565d2de3f3e636ac1cb29bcbf8d41597ac172a1290e rebuilt index: <not found>      committed index: (5, 65211820)
ID: 88187b7ae35d9410461cddd184ce363cc94258d3a9f6a6b88f1f898fbb23b935 rebuilt index: <not found>      committed index: (8, 314648306)
ID: 867d12cb786ede3c95bcb915f629580fee606ddd49f92078d0d6266748e4d34e rebuilt index: <not found>      committed index: (5, 20130112)
ID: 8e51fb30fb550f29a122aa3aea137570b53b503356f09f14cb80d8c58b80eb74 rebuilt index: <not found>      committed index: (11, 36909673)
ID: be55c1377f4e953e042a43fe760c3cc405c0c7d2630ea4b58e4afe1f59d45989 rebuilt index: <not found>      committed index: (8, 193632887)
ID: 7ea4731129094cfdc885e8a01674a5879e777c954b2a05405d55c4e6364da18e rebuilt index: <not found>      committed index: (282, 221964919)
...

Full log is attached.

I hex dumped segment 4 beginning at offset 266609152 (266609618 rounded down to a 512-byte boundary) and I see a long stretch of binary zeroes starting at offset 266609618:

Hex dump ``` 00000000: 1e1a 01b4 0001 4918 0f5b 0901 2f5f 325d ......I..[../_2] 00000010: 094d 03f2 1500 7923 0067 0900 6f0d 022d .M....y#.g..o..- 00000020: 0602 9f01 01ea 0500 2302 024f 1701 6859 ........#..O..hY 00000030: 0295 0902 2000 01ac 0924 6861 385a 02b4 .... ....$ha8Z.. 00000040: 0800 1277 017b 1601 55be 0092 0920 6772 ...w.{..U.... gr 00000050: 3fce 000a 0000 5701 0050 0000 3400 3072 ?.....W..P..4.0r 00000060: 6564 ae87 0030 0600 1b00 4162 6c75 6550 ed...0....AblueP 00000070: 0001 fa02 0022 5c0f 3c13 070f c309 0501 ....."\.<....... 00000080: 8401 1273 5803 00b0 0201 1000 01b3 0901 ...sX........... 00000090: 4c02 073b 0d08 ba09 0459 0302 7602 0121 L..;.....Y..v..! 000000a0: 0510 226a 6512 2c48 8701 e002 0e0f 0000 .."je.,H........ 000000b0: f009 3220 2029 3d00 5978 2e73 6563 240a ..2 )=.Yx.sec$. 000000c0: 0f71 0000 a363 2832 2c20 342c 2036 294c .q...c(2, 4, 6)L 000000d0: 200a 7300 0059 1420 746f 6e00 416c 696d .s..Y. ton.Alim 000000e0: 6519 0100 0d00 6064 6f64 6765 721e 0103 e.....`dodger... 000000f0: ae4c 0079 0003 0f03 0069 0a07 c618 132a .L.y.....i.....* 00000100: be23 0bf4 0200 d226 236c 733a 0f11 6fec .#.....&#ls:..o. 00000110: 5d00 5e67 00bd 5900 8e1b 1160 e601 6660 ].^g..Y....`..f` 00000120: 7369 7a65 6021 0107 bc12 0008 0303 3312 size`!........3. 00000130: 049c 0114 2a18 6c03 8107 0151 0600 730d ....*.l....Q..s. 00000140: 029f 0801 a70b 069e 0800 5103 0104 0800 ..........Q..... 00000150: ab0d 0004 0712 606b 0103 7a00 054f 8601 ......`k..z..O.. 00000160: 8308 003d 0201 1f11 012c 0231 6963 6b9d ...=.....,.1ick. 00000170: 7e20 732e 060e 09b8 7000 d703 03d7 0a05 ~ s.....p....... 00000180: 7000 00cc 0a02 f7a7 0167 6600 a907 01e3 p........gf..... 00000190: 0b00 b200 3043 6f6d 0a27 07ad 0007 e006 ....0Com.'...... 000001a0: 00ad 0000 2d04 01c9 000b 1900 017b 0001 ....-........{.. 000001b0: 8f1c 02d2 7801 ca08 0153 5640 736b 696d ....x....SV@skim 000001c0: 9499 01a8 ba03 ed03 9065 7861 6d70 6c65 .........example 000001d0: 3f0a 0000 0000 0000 0000 0000 0000 0000 ?............... 000001e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ * 00000e00: 0451 3766 3766 37ae 0d07 9b01 b130 2e35 .Q7f7f7......0.5 00000e10: 656d 2030 3b0a 7d0a 2d13 5f71 756f 7465 em 0;.}.-._quote 00000e20: 4100 0403 e702 030f 0451 3a20 302e 323d A........Q: 0.2= 00000e30: 00b6 2e37 3565 6d3b 0a7d 0a68 7238 0103 ...75em;.}.hr8.. 00000e40: 2501 01e4 0d04 6f01 0bd7 0001 8802 0709 %.....o......... 00000e50: 0e34 3737 3793 0262 3a20 3238 7078 9800 .4777..b: 28px.. 00000e60: 2564 6cb0 0003 e302 01af 0056 646c 2064 %dl........Vdl d 00000e70: 641a 0004 c402 027a 0302 7a02 032f 0002 d......z..z../.. 00000e80: 1300 0232 001f 7472 0303 287d 0aed 0201 ...2..tr..(}.... 00000e90: 9900 0165 0024 756c e702 06d6 1361 6369 ...e.$ul.....aci 00000ea0: 7263 6c65 9302 2169 648b 010b 2c03 048b rcle..!id...,... 00000eb0: 0001 4200 4170 7265 2c32 110f 5501 0403 ..B.Apre,2..U... 00000ec0: 9601 0390 0204 6b04 01c5 0001 1907 7d3a ......k.......}: 00000ed0: 2023 3333 333b 1a14 0254 1512 207d 0f06 #333;...T.. }.. 00000ee0: 7701 0941 0002 0601 01f2 0214 200d 0511 w..A........ ... 00000ef0: 201f 0405 c401 1131 0f00 01a7 0052 3a6e ......1.....R:n ```

If I put the borg repository on a local filesystem instead, this does not happen.

This seems closely related to #8233. However, I am confident (at least as far as one can be confident with ZFS) that the server's filesystem is not corrupted. It could still be a bug in the SMB client or server rather than in borg itself.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Yes, this is 100% reproducible in my environment.

Include any warning/errors/backtraces from the system logs

(none)

zackw commented 5 days ago

borg-check.log

ThomasWaldmann commented 5 days ago

If you can't reproduce without using SMB, then the problem is likely in SMB, not in borg.

My comment in #8233 included network fs induced corruption / malfunction.

zackw commented 4 days ago

@ThomasWaldmann I can certainly believe that it's an SMB issue, but do you have any thoughts on how to work out a minimized test case for the SMB people - that is, identify the sequence of file operations that produces a corrupt segment file on the server? The only thing I can think of is to run a delta minimization on the list of files to back up, which will take a very long time and may not work, and then try to dope something out of an strace log of "borg create" with the minimal set, which also may not work.

ThomasWaldmann commented 4 days ago

@zackw yeah, reduce the input data set as long as it reproduces the issue and use strace. Guess from there some samba developer can help better.

zackw commented 3 days ago

Reducing the data set didn't take nearly as long as I thought it would, and I was able to identify the exact set of file operations that borg's doing that triggers data corruption:

int fd = open("/smb/server/borg-repo/data/0/2", O_WRONLY|O_CREAT|O_EXCL|O_CLOEXEC, 0666);
write(fd, buffer1, len1);
write(fd, buffer2, len2);
write(fd, buffer3, len3);
fdatasync(fd);
posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
close(fd);

About one time in five, some amount of the data from the second of the three writes will be replaced by binary zeroes. If a local filesystem is used, there is no corruption; if the posix_fadvise call is removed the corruption happens less often but it still happens sometimes. I'm now convinced this is not a bug in borg, I have an independent reproduction script, and I'm going to take this up with the SMB people.

ThomasWaldmann commented 3 days ago

@zackw Ah, good you found it!

As an explanation for the POSIX_FADV_DONTNEED: borg always appends new data to these files, closes them when "full" (usually at 500MiB size), continues writing to the next one. It may write huge amounts of data that way (up to terabytes and more), but never re-reads written data in the same borg run. To not waste huge amounts of memory for useless fs cache, it does that fadvise call.

And the fdatasync is of course to get it "on disk" ASAP.

zackw commented 2 days ago

Yeah, I figured that was why the POSIX_FADV_DONTNEED. I thought about suggesting an option to disable this call, but in my testing that isn't actually an effective workaround. As far as I can tell, the only thing that might be an effective workaround is to ensure that every write to the /data/ files is a whole number of pages (multiple of 4096) or else that the entire file is written in a single write operation at the C level, both of which are probably not feasible from your end.

In case you want to follow along, the report to the Linux SMB client maintainers is here: https://lore.kernel.org/linux-cifs/9bca55e4-8e48-48cc-810e-f82e8ec71bf9@app.fastmail.com

ThomasWaldmann commented 2 days ago

Wow, that is a great bug report, thanks for filing it!

Hopefully, with all that information and reproducer code given, they will be able to reproduce and find the bug.