backup a big amount of data

ThomasWaldmann commented 9 years ago

In the attic issue tracker there are some old, stale and unclosed issues about big backups and/or consistency / corruption events.

It is rather unclear whether they are really a problem of the software or caused by external things (hardware, OS, ...). It is also unclear whether they were already fixed in attic, in msgpack or in borg.

https://github.com/jborg/attic/issues/264 https://github.com/jborg/attic/issues/176 https://github.com/jborg/attic/issues/166

To increase confidence and to be able to close these issues (for borg and/or for attic), it would be nice if somebody with lots of (real) data could regularly use borg (and/or attic) and do multi-terabyte backup tests.

Please make sure that:

your hardware is stable (no RAM / disk / network or other hw issues), ECC RAM and RAID disks would be nice
you have plenty of free disk space (for ~/.cache and the repo) and enough RAM (depends in data size, see formula in "internals" docs)
your data size should be >= 7.5 TB (volume) and/or you should have really LOTS of files (count)
you use latest borgbackup release AND msgpack >= 0.4.6
you use borgbackup for your first tries (it has better error reporting and more bug fixes than attic)
you use lz4 compression (just because it is the fastest)
you do backups (create), repo/archive checks (check), restores (extract, maybe with --dry-run).

Please gIve feedback below (after you started testing, when you have completed testing). Give how much data you have, in how many files, your hardware, how long it took, how much RAM it needed, how large your .cache/borg got, how large .cache/borg/chunks and .../files is.

ThomasWaldmann commented 9 years ago

Software:

borgbackup 0.26.1
msgpack 0.4.6
python 3.4.3
ubuntu 14.04 64bit.
ext4 FS

Hardware:

normal PC
AMD Athlon II X2 240 Processor, 2x 2.8GHz
8GB RAM (no ECC)
the 5 and 10TB disks were connected via a USB3 hub
the 10TB "disk" was in fact a RAID0 of 2x 5TB disk
disks showed nothing special in SMART data after the backup

I tested with 5000 directories that all looked like this one:

$ du -sk /mnt/10TB/0/*
246836  /mnt/10TB/0/bin     # linux binaries
250840  /mnt/10TB/0/jpg     # (compressed) pictures
102784  /mnt/10TB/0/ogg     # (compressed) audio
8       /mnt/10TB/0/sparse  # a single 1GB empty sparse file
264828  /mnt/10TB/0/src_txt # text files, excluded from backup
154228  /mnt/10TB/0/tgz     # a single tgz file of that size

Aside from the empty sparse file, there were NO duplicate files (I modified the data with a counter value, so there are no [or at least not many] duplicate chunks in them).

Here is the script I used:

borg init --encryption repokey /mnt/5TB/borg
/usr/bin/time \
    borg create --progress --stats --compression lz4 \
    --checkpoint-interval 3600 --chunker-params 19,23,21,4095 \
    --exclude '/mnt/10TB/*/src_txt' \
    /mnt/5TB/borg::test /mnt/10TB

I decided to exclude the text files because they slowed down the backup significantly and I didn't want to wait that long.

Here is the result of the backup script run:

$ /usr/bin/time ./test10TB.sh
Enter passphrase for key /mnt/5TB/borg:
------------------------------------------------------------------------------
Archive name: test
Archive fingerprint: 736476b5c2bebe6f9164d04ef1d75dc3ce29e581c5c0e08fada987b7e63
Start time: Thu Oct  1 15:27:20 2015
End time: Sun Oct  4 11:35:12 2015
Duration: 2 days 20 hours 7 minutes 51.98 seconds
Number of files: 7270000

                       Original size      Compressed size    Deduplicated size
This archive:                9.22 TB              3.31 TB              2.68 TB
All archives:               10.87 TB              3.90 TB              3.26 TB

                       Unique chunks         Total chunks
Chunk index:                 8317729             10663690
------------------------------------------------------------------------------
135412.89user 11328.77system 68:07:57elapsed 59%CPU (0avgtext+0avgdata 3387492ma
6312665320inputs+5487594576outputs (146major+6537682minor)pagefaults 0swaps

$ borg list /mnt/5TB/borg
Enter passphrase for key /mnt/5TB/borg: 
test.checkpoint                      Thu Oct  1 14:21:53 2015
test                                 Sun Oct  4 11:34:15 2015

I interrupted the backup once (Ctrl-C) because it appeared slow. Then I removed the os.fsync from the source and it got faster. That is why this checkpoint exists.

After the backup had completed, I tried to remove the checkpoint (not needed any more):

$ borg delete /mnt/5TB/borg::test.checkpoint
Enter passphrase for key /mnt/5TB/borg: 
borg: Error: Data integrity error
Traceback (most recent call last):
  File "/home/tw/w/borg/borg/archiver.py", line 959, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/home/tw/w/borg/borg/archiver.py", line 915, in run
    return args.func(args)
  File "/home/tw/w/borg/borg/archiver.py", line 316, in do_delete
    repository.commit()
  File "/home/tw/w/borg/borg/repository.py", line 156, in commit
    self.compact_segments()
  File "/home/tw/w/borg/borg/repository.py", line 217, in compact_segments
    for tag, key, offset, data in self.io.iter_objects(segment, include_data=Tru
  File "/home/tw/w/borg/borg/repository.py", line 549, in iter_objects
    raise IntegrityError('Segment entry checksum mismatch [offset {}]'.format(of
borg.helpers.IntegrityError: Segment entry checksum mismatch [offset 3097455]

borg: Exiting with failure status due to previous errors

Whoops!

I ran some borg check --repository-only and got crc errors quite frequently, but not reproduceably at the same places, so I suspected some hardware issue.

I tried again, with the USB3 disk being directly connected (no hub) to the host system: checksum issues GONE! So the usb3 hub seems to have caused these troubles.

Happy end:

$ borg check --repository-only /mnt/5TB/borg
Starting repository check...

Repository check complete, no problems found.

Some data (this was after successfully deleting the checkpoint):

$ borg info /mnt/5TB/borg::test
Enter passphrase for key /mnt/5TB/borg:
Name: test
Fingerprint: 736476b5c2bebe6f9164d04ef1d75dc3ce29e581c5c0e08fada987b7e630270d
Hostname: server
Username: tw
Time: Sun Oct  4 11:34:15 2015
Command line: /home/tw/w/borg-env/bin/borg create --progress --stats --compression lz4 --checkpoint-interval 3600 --chunker-params 19,23,21,4095 --exclude /mnt/10TB/*/src_txt /mnt/5TB/borg::test /mnt/10TB
Number of files: 7270000

                       Original size      Compressed size    Deduplicated size
This archive:                9.22 TB              3.31 TB              3.26 TB
All archives:                9.22 TB              3.31 TB              3.26 TB

                       Unique chunks         Total chunks
Chunk index:                 8317532              9043332

# ----------------------

I also did a full `borg extract --dry-run` - worked without problems.

# ----------------------

# repository /mnt/5TB/borg/*
$ du -sh *
4.0K    config
3.0T    data    (this is a directory structure, not a single file)
2.9M    hints.517397
641M    index.517397
4.0K    lock.roster
4.0K    README

# cache ~/.cache/borg/REPOID/*
$ du -sh *
705M    chunks
4.0K    chunks.archive.d
4.0K    config
696M    files
4.0K    lock.roster
4.0K    README
1.4G    txn.active

I didn't measure runtime or memory needs (due to the USB connection, I assumed it wouldn't be super fast), but when looking at index file sizes, I'ld estimate it used at least 2GB of memory (chunks + files + index.NNNNNN size), likely a bit more.

ThomasWaldmann commented 9 years ago

Here is the script I used to generate big amounts of data:

https://github.com/borgbackup/backupdata

ThomasWaldmann commented 9 years ago

BTW, I don't have the 15TB free space any more, so I can't do more big volume tests. 3rd party help / confirmation is appreciated.

alraban commented 9 years ago

So I've got 8+ TB of real data, and if it would be helpful I can try and use borg for my daily backups (I've got redundancy for older backups).

History: I've been using attic for small amounts of personal data, but it proved unsuitable for large backups. I previously (3 or 4 months ago) tried to backup "everything" with attic and hit two major issues: 1) it would routinely crash when it was using enough memory to start swapping in earnest (but well before it exhausted available swap) and 2) I encountered the corruption issue in jborg/attic#264 (after spending several days running the first backup in stages). I saw the bug (and lack of response), and didn't bother trying a second time.

It looks like the chunking options you've added will allow me to reduce the memory pressure to a manageable level, which will make it less of a pain to try again. So I'm game to take a stab at running a full backup with borg and using it for my dailies. Let me know if you still want testing on this point, this was a huge thorn in my side with attic, so I'm excited it's getting some love over here.

ThomasWaldmann commented 9 years ago

@alraban yes, more tests are welcome!

borg has a bit better error messages and shows tracebacks, so in case you run into an issue like https://github.com/jborg/attic/issues/264 there likely will be more information we could use for analyzing it.

For 8+TB, try a chunk size of ~2MB (--chunker-params 19,23,21,4095) that will produce up to 32x less the chunks than the default settings. Also, giving a higher than default value for --checkpoint-interval might be useful.

Use the latest release code please.

fr34k8 commented 9 years ago

@alraban i just checked these days the functionality and consistency.. using latest code as base and every time from scratch @ThomasWaldmann mentioned.

@alraban let's get deeper in https://github.com/borgbackup/borg/blob/master/borg/_chunker.c what do you think?

alraban commented 9 years ago

Just a status update: I started a test run with borg 0.27 last Thursday. The first backup finished yesterday and I'm running an integrity check (with the --repository-only flag), which has been going about eight hours and I'd estimate is about half-way done based on disk access patterns. Once I've finished the integrity check, I'll extract some files, check them, and then run a second backup, followed by another check and test extraction. At that point I'll post detailed results (system specs, exact command flags used, memory usage, bottlenecks, whatever else would be useful). At current rate of speed seems likely that will be sometime later this week.

Preliminary good news: using your chunker settings, borg made it all the way through a 9.65TB (8.6TiB) backup without crashing or throwing errors, which is already a good improvement over my previously observed behavior with attic 16. We'll see soon whether the corruption issue has also been sorted (fingers crossed).

alraban commented 9 years ago

So I finished up, and there appears to be no corruption issue with a large data set. I'll continue doing testing (through regular backups), but I'm fairly comfortable that things are working. You requested info about testing environment and methodology above and I'll provide it here:

Software on both the Source and the Destination machines:

borgbackup 0.27 (release)
msgpack 0.4.6
python 3.4.3
Debian Jessie (all updates)
ext4 FS

Hardware:

Source (File Server):
Intel 4790S Processor 4 x 3.2GHz
8GB RAM (no ECC)
5 internal multi-terabyte drives (no RAID or LVM)
All disks have clean SMART data before and after backup

Destination (Dedicated Backup Box):
Intel 1037U Processor 2 x 1.8Ghz
4GB RAM (no ECC)
13TB worth internal internal drives in a single LVM volume (no RAID)
All disks have clean SMART data before and after backup

After init, I ran: borg create -sp -C lz4 --chunker-params 19,23,21,4095 $USER@$DESTINATION:/home/$USER/$SOURCE.borg::$(date +%Y-%m-%d) /home/$USER

A few days later I got this result

------------------------------------------------------------------------------ 
Archive name: 2015-10-15
Archive fingerprint: ff0fbfa2eea17982a595d59f5af9e8fadd43b6fb47c9119995d35d8d5abf205a
Start time: Thu Oct 15 14:54:38 2015
End time: Sun Oct 18 23:44:31 2015
Duration: 3 days 8 hours 49 minutes 53.30 seconds
Number of files: 308772

                       Original size      Compressed size    Deduplicated size
This archive:                9.65 TB              9.61 TB              9.45 TB
All archives:                9.65 TB              9.61 TB              9.45 TB

                       Unique chunks         Total chunks
Chunk index:                 3914527              4019458
------------------------------------------------------------------------------

I extracted a few files, and even browsed the repo as a fuse mount. Everything worked as expected (albeit slightly slower than a normal borg fuse mount). Then I ran a --repository-only check on it and it came back clean after about 30 hours.

I removed a bunch of files and added some large new ones and ran another backup, and was pleasantly surprised to see the following less than an hour later:

------------------------------------------------------------------------------ 
Archive name: 2015-10-20
Archive fingerprint: fcdd7446b6e79bf64062c882e04ebeef750aec69ab06723f4ceeded1e74869e7
Start time: Tue Oct 20 16:06:29 2015
End time: Tue Oct 20 17:03:48 2015
Duration: 57 minutes 18.86 seconds
Number of files: 216414

                       Original size      Compressed size    Deduplicated size
This archive:                9.40 TB              9.38 TB             91.01 GB
All archives:               19.06 TB             18.99 TB              9.54 TB

                       Unique chunks         Total chunks
Chunk index:                 3960575              7856075
------------------------------------------------------------------------------

I then extracted a directory and browsed the new archive as a fuse mount and everything worked as expected. I'm running another check right now, but I'm fairly confident that the issue is addressed as I couldn't extract any files at all the last time this issue hit (with attic). So I'm a happy camper.

Performance Notes:

Memory performance: The maximum used on the destination machine was about 2.2GB above the machine's idle memory usage. The machine is a headless backup only machine so it's running almost nothing in the background. Of that 2.2GB only about .4GB was every registered to the borg process, the rest of the memory appeared to be tied up in slab cache (which steadily grew as the backup went along and then vanished when it ended). Interestingly the follow up backup didn't use anywhere near as much memory (only about .5GB total). Excellent memory performance for such a large backup and never got close to swapping. I'm guessing I could probably tune the chunk parameters to make fuller use of available memory, but unless that would result in a speed gain I'm probably not interested.

Compression and deduplication performance: At first glance the compression performance looks bad, but 99% of the files being backed up are digital media so are already pre-compressed to a high degree. I only enabled compression to make the test more realistic and because quick tests didn't show any speed advantage to not compression in my test case (I think because compression takes place on the Source machine, which in my case has CPU cycles to burn). Deduplication performance was also quite good. I removed a few hundred gigabytes of files and added almost exactly 89GB of new files, and moved around/renamed some old ones. So the deduplicator did a pretty great job, only picking up about 2GB worth of "static" on a 9.4TB dataset in a little less than an hour!.

Time performance: the bottleneck in my case was the CPU on the Destination machine. The two machines are connected by a Gigabit link and it never got more than a third saturated, averaging about 35MB/sec. Similarly disk throughput utilization never got above 60%. One CPU core on the destination machine remained pegged at 100% for the entire backups and checks, but the other core never moved (I'm guessing the process is single-threaded other than for compression).

Let me know if anymore information or testing would be useful. I'll keep running regular backups at intervals with release versions and report back if anything untoward happens.

ThomasWaldmann commented 9 years ago

@alraban thanks a lot for extensive testing and reporting your results.

a few comments:

compression: I guess one can't do much wrong with enabling lz4 in almost all situations (except maybe when having a crappy CPU and potentially great I/O throughput at the same time). lz4 has great throughput.

100% cpu on destination machine: that's strange as it is just doing ssh and borg serve storing stuff into the repo + updating the repo index. ssh and borg can even use 1 core each. did you see which process was eating 100%, was it borg? do you have disk or fs encryption (dmcrypt, ext4 encryption, ...) enabled on the repo device?

I'ld rather have expected the throughput limit to be limitations from being single threaded currently (waiting for read(), fsync(), not overlapping I/O with computations, no parallel computations).

alraban commented 9 years ago

No encryption, but you're right that I may have misread the evidence. Looking closer at my munin graphs, the i/o wait is a pretty significant part of the picture. No single process was eating 100% CPU, the total system CPU usage stayed at almost exactly ~100% (out of 200%) with a load of 1.2.

According to munin, over the course of the backup the total CPU "pie" was on average 11% system, 33% user, 44% i/o wait, 9% misc, and 102% idle. Looking at top occasionally throughout, I don't think I ever saw borg or sshd get much above 20% or 25% each.

So I may have been lead astray by the ~100% cpu usage, which looked suspiciously like one core working by itself, when in reality the story is a little different.

If I can provide any other data, let me know.

ThomasWaldmann commented 9 years ago

ok, thanks for the clarification. so I don't need to hunt for cpu eaters in borg serve. :)

ThomasWaldmann commented 9 years ago

I am closing this one. At least 2 multi-TB tests done, nothing special found (that is borgbackup related).

More multi-TB tests are appreciated, either append them to this ticket or send them to the mailing list.

dragetd commented 8 years ago

Following setup:

four machines (mixed ubuntu 64bit 16.04 and 14.04) pushing into the same repository
local 1GB network
~4TB of source data
no encryption on a daily basis.

Plus

one machine pushing into a separate repository
~6BT of source data
no encryption on an irregular basis.

So far no issues encountered! :-)

alraban commented 8 years ago

Just wanted to check back in; I've now been using borg for daily backups of ~10TB of real data for about 9 months (since my post above from last October). In that time I've had to do a few partial restores and I do daily pruning. The repo as a whole is just under 12TB total at this point and contains 12 archives at any given time. All the restores have gone perfectly (albeit very slowly, typically about 1/10th the speed of the initial backup). I also did some checksumming for comparison on a few occasions and everything checked out perfectly.

Performance for the daily backups is quite good considering the volume of data. As noted upthread, the initial backup took a few days, but the dailys only take about an hour on average (less if nothing much has changed). The data being backed up contains some smaller borg repositories that contain hourly backups from my workstations, and there seems to be no issues with repos inside of other repos (all "borgception" restores successful). I've been using lz4 compression and haven't been using encryption.

I'm happy to provide any additional performance or other information if that would be helpful, but from where I sit it's mostly a good news story: no data issues, successful restores, flexible incremental backups with good deduplication, and reasonable backup speed.

The only really serious scalability issue is that restore speed (at least for the limited partial restores I've done) is really quite slow. I was seeing between 10GB or 20GB per hour. If that's representative, I expect a full restore of my dataset would take between three and six weeks! That wouldn't fly in production, but for my uses could potentially be tolerated (especially if the alternative is total data loss). For insurance, I still take a conventional rsync backup at intervals so I can fall back on that in a total catastrophe.

But any backup system that (eventually) returns your data intact is a successful one in my book; and being able to reach back and grab individual files or directories as they existed months ago is fantastic. Thanks for all your good work on this, borg has already saved my bacon more than once :-)

enkore commented 8 years ago

Thanks for your testing ^W usage report! :)

10-20 GB/hour = a couple MB/s -- How did you restore, FUSE or borg extract? FUSE in 1.0 has the issue that partially read chunks aren't cached, so short reads from applications typically have linearly worse performance than longer reads. See #965 for details; speedups on the order of ~60 are typical of that changeset, since many applications make 32k (~2 MB / 32k is ~60) reads.

alraban commented 8 years ago

I restored via FUSE because I needed to browse a bit and use some scripts to get the exact files I needed. Do I understand you to be saying that extract is notably faster?

In any case it sounds like FUSE performance will improve dramatically in the near future; a 60-fold increase in speed would be quite welcome and would effectively make the restore time more or less symmetrical to the backup time.

enkore commented 8 years ago

Do I understand you to be saying that extract is notably faster?

Yes. This affects not only things like shaXXXsum, but also e.g. KDE/Dolphin which also do 32k reads with a 60x slow-down in 1.0. If you have large files you wish to extract over FUSE then using dd with a large bs parameter (>2 MB) should have the same result, fast result as borg-extract in 1.0.

extraction via "fixed" or work-arounded FUSE (1.1+, not yet released) or borg-extract should be as fast as create or faster (since decompression is normally faster than compression, extraction doesn't need to do any chunking, and reading is typically a bit faster than writing from disks).

ThomasWaldmann commented 8 years ago

From maltefiala in #1422:

""" A long time ago there has been an issue regarding testing with lot's of data. I successfully finished a 20TB backup of real life data over the network with success this week.

That's all folks, m """

tjakobi commented 8 years ago

Hardware backup server:

borg host: 112 TB in a RAID 6 configuration
40Gbit InfiniBand connection
Xeon CPU E5-1620 v3
64GB RAM (ECC REG)

Borg:

Version 1.0.7
no encryption
LZ4 compression
1.997.161 files, total size around 40TB
Input data already heavily compressed (gzip)

Total time for the initial snapshot was around 3 days. A check for consistency takes around 12 hours. Actually, here I'd like to see a performance gain in future here because I'm thinking of verifying the whole backup after each addition. I'm now running the backup every night at 3 AM which takes between half an hour and some more hours if we had a massive data income.

I'm currently using the following command line: /usr/local/bin/borg create --exclude-if-present .nobackup --list --stats -C lz4 -p --exclude-from /root/borg.exclude /mnt/borg::{now:%Y-%m-%d_%H%M-}{borgversion} /beegfs/ 1> /root/borg.stdout 2> /root/borg.stderr

I'm kind of wondering why the stderr file is full of the \r printout from borg. I thought only specifying --stats and --list would give me a nice stats table at the end without any other output.

I was able to mount a snapshot with fuse to recover one accidentally deleted file. I was also able to directly restore a test set of 1.5TB of data in roughly 300 minutes. My over the thumb calculation for a full restore is around 5-6 days... although I hope I'll never get into that situation.

Overall I have to say I'm really happy how easy everything went, especially the deduplication function and its performance amaze me (we're storing mostly genomic data, i.e. text files and compressed text files).

------------------------------------------------------------------------------  
Archive name: 2016-09-24_0300-1.0.7
Archive fingerprint: aaf948e292846bf5ed2a86a6d5b7977717479be42414faed2c715463dfdf17d4
Time (start): Sat, 2016-09-24 03:00:02
Time (end):   Sat, 2016-09-24 03:36:06
Duration: 36 minutes 4.22 seconds
Number of files: 1997161
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:               41.42 TB             36.53 TB            100.70 GB
All archives:              200.64 TB            177.69 TB             33.75 TB

                       Unique chunks         Total chunks
Chunk index:                15888554             83570341
------------------------------------------------------------------------------

Anyway, would the developers recommend to to a check after each run with archives this large? I'd really like to be sure that I'm not writing garbage every night :)

enkore commented 8 years ago

The segment checking performed by borg-check is checking a CRC32 over the data. I would expect that in a "good and proper" server like yours - with ECC memory, RAID 6, proper RAID controllers and maybe ZFS(?) - that this is relatively unlikely to catch problems (in the way that all the other checksumming should catch them first). It doesn't hurt, though.

What might be interesting here is an incremental check, ie. only check data written since the last time borg-check ran. This would allow to verify everything was written correctly, while still allowing to check everything from time to time, to detect silent data corruption.

You may also find the new --verify-data option interesting, it uses two rounds of HMAC-SHA-256 (for now, when encrypted) on the data, which has a probability to detect tampering of ~1.

I'm kind of wondering why the stderr file is full of the \r printout from borg. I thought only specifying --stats and --list would give me a nice stats table at the end without any other output.

-p enables progress output. Depending on how you view the file this may be non-obvious.

Thanks for the report! :) Really nice setup you got there

ThomasWaldmann commented 8 years ago

@tjakobi thanks for the feedback, biggest known borg backup yet. \o/

About --verify-data we have to admit that while it is very thorough checking, it may be also very slow. I ran it recently on my little 400GB repo and it took days to complete, with little cpu load. It is a rather weak server, though, so yours might have better throughput.

I analyzed it and found that its slowness is caused by accessing chunks in index order (which is kind of random) and not in sequential on-disk order. I have begun working on optimizing that to work in disk order...

tjakobi commented 8 years ago

The segment checking performed by borg-check is checking a CRC32 over the data. I would expect that in a "good and proper" server like yours - with ECC memory, RAID 6, proper RAID controllers and maybe ZFS(?) - that this is relatively unlikely to catch problems (in the way that all the other checksumming should catch them first). It doesn't hurt, though.

We're using an LSI MegaRAID SAS 2208 Controller with a battery backup unit, the server is secured by an UPS. The file system is actually only a simple ext4 system because of some other reasons. The data is fed in to the backup server from a BeeGFS file server cluster through InfiniBand, also secured by identical LSI controllers with battery backups and UPS. I'd say it's a pretty stable setup so far.

An incremental check option seems definitive interesting. I would trust my setup so far that data that's already checked stays healthy. What would happen in case a chunk turns out to be faulty? Is the whole archive unusable or just the part of the archive contained inside the broken chunk?

Also: thank you very much for you support, I'll try to contribute insights from my side whenever necessary.

enkore commented 8 years ago

What would happen in case a chunk turns out to be faulty? Is the whole archive unusable or just the part of the archive contained inside the broken chunk?

It depends(tm)

As always when metadata is corrupted the effects are often more drastic than simple data corruption, however, metadata is usually smaller compared to data, so it will be hit less frequently (by random issues).

If check detects a broken data chunk, it will be marked as broken. That part of a file would read as a string of zeroes. Should the same chunk be seen again during a borg-create operation, then a later check will notice that and repair the file with the new copy of the chunk.

When a metadata chunk is corrupted the check routine will notice that as well and cuts the part out, continuing with the next file/item whose metadata is uncorrupted. With the default settings each metadata chunk would usually contain somewhere between ~500 files at most (short paths, no extra metadata, small files) to <=1 file (very large files, lots of xattrs or ACLs, very long paths).

An incremental check option seems definitive interesting.

I created a ticket for it: #1657

enkore commented 8 years ago

Regarding check performance -- larger segments should improve performance. You can change this on the fly in the repository config file, max_segment_size + two zeroes => 5 -> 500 MB. This is only effective for newly created segments, though.

We might also improve performance a bit by enabling FADV_SEQUENTIAL when reading segments.

chrysn commented 8 years ago

an error recovery story (hope that fits here): i've recently had similar errors to those in jborg/attic#264 on a remote operational since summer 2015 with semi-regular backups of about 200gb of developer machine data.

problems showed up when pruning (unfortunately lost the error message), then a check gave dozens of lines similar (only byte string differing) to

Remote: Index mismatch for key b"p\x98\xfd4\xb4\x1a'z8G\xc4\xd3mq\xe0h\x9f:\xb9\\c\x0c}O\x92Z\x94\xa1\rD\xfd\x99". (24310, 4681958) != (-1, -1)

contrary to what's described in jborg/attic#264, a borg check --repair appears to have succeeded, as later checks completed without errors. both the repairing check and the next backup run took long (about 24h check, 8h backup) and seemed to download larger portions from the remote site, but otherwise worked without errors.

the borg version currently in use on both ends is 1.0.7. the output of the latest backup run (for purposes of judging size):

------------------------------------------------------------------------------           
Archive name: chrysn-full-2016-10-27-08:39                                               
Archive fingerprint: f2742bfad99f3cdfc42c65013c4e3a5a93a962d11e26e16902f46deb31423e2c                      
Time (start): Thu, 2016-10-27 08:39:57                                                   
Time (end):   Thu, 2016-10-27 17:36:19                                                   
Duration: 8 hours 56 minutes 22.71 seconds                                               
Number of files: 2033840                                                                                      
------------------------------------------------------------------------------           
                       Original size      Compressed size    Deduplicated size           
This archive:              187.71 GB            187.79 GB            906.31 MB           
All archives:               10.52 TB             10.53 TB            275.17 GB           

                       Unique chunks         Total chunks                                
Chunk index:                 2239911            112276923                                
------------------------------------------------------------------------------

ThomasWaldmann commented 8 years ago

@chrysn were these backups made with attic or borg? if the latter: which versions?

chrysn commented 8 years ago

all created with borg, possibly as early as 0.23 (although no archives from back then have been in the repository for some time; i've probably used most debian-released versions since then).

ThomasWaldmann commented 8 years ago

@chrysn ok, maybe it was the problem that is described at the start of changes.rst, which was fixed in 1.0.4.

ThomasWaldmann commented 8 years ago

@chrysn hmm, how often do you run check? could it be a pre-1.0.4 problem (and you check infrequently) or did it happen after 1.0.4?

chrysn commented 8 years ago

i haven't run check in ages, it can easily be a pre-1.0.4 problem. my main point of reporting this was probably less about the issue still happening than about check --repair nowadays being able to fix the issue.

ThomasWaldmann commented 8 years ago

@chrysn ok, so let's hope it was one of the issues fixed in 1.0.4. :)

ThomasWaldmann commented 7 years ago

https://irregularbyte.otherreality.net/tags/Borg/ some blog post about a bigger scale borg deployment.

makmanalp commented 6 years ago

As always when metadata is corrupted the effects are often more drastic than simple data corruption, however, metadata is usually smaller compared to data, so it will be hit less frequently (by random issues).

@enkore Perhaps the metadata here (being important and relatively small) is a great candidate for par2 or some other erasure code with a large redundancy level or something!

ottojwittner commented 8 months ago

My company has decided to applying borg (and borgmatic) as backup system for a number of linux and netbsd servers. Most server are "happy" however one server holding a very large number of smaller files is struggling to complete its first backup. A session, applying borg over ssh, is running for hours for suddenly to break down. We have enabled debug-logging in both ends (via BORG_LOGGING_CONF on server side) however struggle to understand what makes the session break. Nether borg nor ssh seem to report much.

Any suggestion to how we may debug this situation better?

ThomasWaldmann commented 8 months ago

@ottojwittner please open a new ticket, provide borg version, borg output when it "breaks down" and other stuff the issue asks for. Maybe it is a ssh / network issue. In general, you can just restart using the same borg command and it will progress over time, deduplicating against what it already has in the repo.

borgbackup / borg

backup a big amount of data #216