borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11.26k stars 747 forks source link

Add --files-from and --files-from0 options #841

Closed ThomasWaldmann closed 3 years ago

ThomasWaldmann commented 8 years ago

See: https://github.com/jborg/attic/pull/321

Seems like a good feature - opinions? Code review?

level323 commented 8 years ago

I would really appreciate this feature added to borgbackup.

On 6 April 2016 at 08:54, TW notifications@github.com wrote:

See: jborg/attic#321 https://github.com/jborg/attic/pull/321

Seems like a good feature - opinions? Code review?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/borgbackup/borg/issues/841

Ape commented 8 years ago

I can understand that --files-from can be useful, but in which case would --files-from0 be better?

enkore commented 8 years ago

Mainly binary file names (which are indeed used) and files named from "free text" fields by software. In any case, such support is essential for any software which seriously wants to support "what POSIX demands".

pepa65 commented 8 years ago

--files-from0 because file names can contain newlines

jim-collier commented 4 years ago

I just realized borg doesn't have this feature. Which means I don't think I can use it. (Dang - spent days getting it set up .)

When I did my initial evaluation of backup software about a year ago, such a feature was in my top five list of requirements. For some reason, I marked borg as "yes" for this feature. I'm only now discovering it doesn't have it - which, sadly, is a deal-breaker.

I realize that --patterns-from could kind of be hacked to do something similar, with the pp prefix. However, this feature is marked "experimental", and with tens of millions of files to back up, I wouldn't feel confident enough in using it.

The problem I'm trying to solve, is creating smaller batches of files to back up, than the whole volume all at once - so that it has some hope of completing in smaller chunks before a restart, crash, extended power failure, or network interruption. (I realize that an interruption is not the end of the world for borg. But for other reasons to in depth to describe, this is the path I've chosen to tackle the problem.) Problem is, building these smaller batches of files is well beyond the ability (and indeed scope) of borg, hence the need to build them in other tools, and the need for a --files-from feature.

Any chance such a feature may get added in the near future?

ThomasWaldmann commented 4 years ago

Why don't you partition your stuff based on a few starting directories?

You could also try the pf pattern, it is rather efficient (compared to other patterns) by using a python dict.

ThomasWaldmann commented 4 years ago

BTW, when thinking about backup schemes, always keep in mind that borg's way is always doing full backups. "Full" meaning here the full input dataset, not necessarily the complete disk or filesystem.

But in any case, it is not meant to be used in a full+incremental way (if you look at prune, you'll see it'll just kill any archive that does not match the retention rules).

jim-collier commented 4 years ago

The problem isn't that restic always does "full backups". That's a good thing.

The problem is the first full backup. With 15 TiB to back up, even with 950 Mb/s upload (10MB/s max restic to Backblaze b2), it absolutely will fail before finishing. Esp with rolling blackouts and PSPs. The check/rebuild-index/recover process is proving quite time-consuming with a growing repository, which just shortens the time before the next outage by that much. (Doesn't seem to chew up too much CPU/disk/bandwidth, at least?)

So by "seeding" it with smaller partial backups - say <1tb each, with most of them completing without interruption (and the ones that fail being fixed more quickly), then a final "full" backup and daily full backups will then be very quick since almost all of the data is already there.

In fact I'm almost done with a tool that scans the backup source, applies regex exclude rules, re-orders the files based on a ranking algorithm that takes size and date into account (newer = ranked higher, smaller = ranked higher), and then chunks them into separate lists of files that don't exceed a specified size in total. (I need this anyway outside of Restic but it could come in very handy.) This could be very easily updated to use the pf matching pattern...if it can handle millions of files? The docs said something about "large file list", but such adjectives often don't mean the same thing that I mean ;-)

In the mean time, after careful analysis, filtering on file extension is the best way to limit each seed chunk to a manageable size. Starting with different subdirectories, while not a bad idea in general, won't work in this scenario, and rearranging data isn't an option.

knutov commented 4 years ago

@jim-collier I suppose described problem do not exists - borg will automatically do checkpointing snapshots during backup so you will not have to upload already uploaded data in case if backup process will be interrupted in any way.

By default checkpoint is made every 5 minutes, it is possible to change this parameter in command line with --checkpoint-interval SECONDS: https://borgbackup.readthedocs.io/en/stable/faq.html#if-a-backup-stops-mid-way-does-the-already-backed-up-data-stay-there

You can also filter files during initial backup, for example do not upload big media files filtered by extension to speed up backup of really important files.

jim-collier commented 4 years ago

@knutov There's a secondary reason for smaller backups, that I didn't go into in order to avoid making the discussion more complex: Backing up the most important - and the most - data first.

Since:

  1. completing the first backup is going to take so long (but a far cry better than the 8 years for Crashplan), and
  2. All of our older data is already safely backed up on Crashplan, I need to prioritize Borg backup by:
    1. Newest
    2. Smallest

The reason for weighting smaller files higher, is because: After careful analysis this seems to be broadly true: If a file is actually valid user data (as opposed to a cache or system metadata file which can/should be accurately excluded), then any given file is of equal importance, regardless of size, all else being equal.

Axiomatically, by prioritizing smaller files first, you (or at least we) get more units of "important data" (i.e. number of files) backed up per given unit of time. In our case, significantly so.

This is why I really need to be able to point to an externally-generated list of files, to backup in the specific order I need, that I generate with other tools based on that criteria and custom weighting. Which is why I'm looking into --patterns-from and the pf selector. But it's going to be huge - millions of files - and the "experimental" status makes me uneasy.

In the mean time, selecting by file extension is a short-term kludge to kind of reach the same goal. (As a simplified example with only four types of data: backup .xlsx files. Then .docx. Then .psd. Then .vmdk. Then unfiltered, to sweep up the stragglers.)

jim-collier commented 4 years ago

@knutov There's a secondary reason for smaller backups, that I didn't go into in order to avoid making the discussion more complex: Backing up the most important - and the most - data first.

Since:

  1. completing the first backup is going to take so long (but a far cry better than the 8 years for Crashplan), and
  2. All of our older data is already safely backed up on Crashplan, I need to prioritize Borg backup by:
    1. Newest
    2. Smallest

The reason for weighting smaller files higher, is because: After careful analysis this seems to be broadly true: If a file is actually valid user data (as opposed to a cache or system metadata file which can/should be accurately excluded), then any given file is of equal importance, regardless of size, all else being equal.

Axiomatically, by prioritizing smaller files first, you (or at least we) get more units of "important data" (i.e. number of files) backed up per given unit of time. In our case, significantly so.

This is why I really need to be able to point to an externally-generated list of files, to backup in the specific order I need, that I generate with other tools based on that criteria and custom weighting. Which is why I'm looking into --patterns-from and the pf selector. But it's going to be huge - millions of files - and the "experimental" status makes me uneasy.

(In the mean time, selecting by file extension is a short-term kludge to kind of reach the same goal. (As a simplified example with only four types of data: backup .xlsx files. Then .docx. Then .psd. Then .vmdk. Then unfiltered, to sweep up the stragglers.)

knutov commented 4 years ago

It looks like you are still trying to solve not existing problem.

Just do backup with borg once, all secondary backups will be very fast.

jim-collier commented 4 years ago

@knutov You're not understanding or appreciating my requirement to back up the most important data first. (Rather than spending weeks on, say, VM images. Which need to get backed up eventually, but last. [As an example of many, too many out of 15 TiB to chase down in any kind of reasonable timeframe.]) Maybe if you were surrounded by wildfires, or perhaps had a hurricane barreling down on you, that requirement might seem more relevant.

That's OK. I don't mean this in any kind of mean way, but the stakes for your appreciating this random users' top requirements, couldn't be lower. :-)

Only to say that, just because you don't understand and/or agree with another users' requirements (and or the underlying reasons which may not be exhaustively explained), doesn't mean it's a "not existing problem". It's among a general family of unconstructive feedback which, as a general rule, is well-understood to waste time and space on "expert forums" such as stackexchange, or this.

enkore commented 4 years ago

Just... create an archive with the important data and one with all the rest? I've been doing that since 2016 and it just works. Besides, checkpointing in Borg also "just works". Note that you can extract from checkpoint archives.

I get that it might be annoying to figure out where the important bits are, especially with a large dataset and if everything is interleaved instead of certain heaps being neatly separated (like having all VM images in /data/vms). On the other hand, Borg is already a huge complicated mess/maze of interacting options, especially in the "input files" department.

jim-collier commented 4 years ago

@enkore Not a bad idea. Pretty good one actually, esp since at best, borg is only able to utilize a fraction of my available bandwidth (950 Mb up), and parallelizing that would be a good thing. I'll think about how I could logically segment my data without moving things around on-disk or other non-starters. ...which might actually wind up relying on the same solution I'm having to resort to anyway. (And which would benefit even more greatly from a --files-from option.

I've used a wide variety of backup and sync tools, and given many more pretty good trials, and have read through the docs of an order of magnitude more on top of that. Some form of --files-from is a pretty universal option, across a wide swath of problem domains. In fact, among all the possible features and flags, I'd wager that is one of the most common. rsync even has it, and with that one feature alone, it's ridiculous mess of options, [arguable] disaster of complex include/exclude ordering rules, and glaring lack of regex support - makes rsync actually pretty usable, dare I say delightful. (And in my biased opinion based on unique experiences and history, adds support to the argument for adding the flag. Additional flags don't always make things more complicated. Many times - granted almost certainly a significantly minority of times - adding a feature via optional parameter can cut through the clutter and make the tool significantly simpler to use.)

knutov commented 4 years ago

@jim-collier I do everyday backup of much bigger data from 10+ hosting servers and I suppose I understand the problem you think you need to solve. So, that's do some math based on my experience.

We have 1 gigabit link between servers and fast enough storage system - we are mostly using zfs raidz2 on rotational disks with optional ssd zil, or raidz on sata ssd storage, so we can always write at least 200-500 megabytes per second. Gigabit link is about 90-120 megabytes per second depends on latency, rtt and tcp windows size. Let's round to 100 mb/s.

So, first initial backup (if we forget about compression and deduplication) should take 15'000'000 / 100 = 150'000 seconds = 2500 minutes = about 41.5 hours = about 1.73 days.

Less then two days. Does it much if it's only first backup (and you can do backup in your current way during this in parallel)?

You mentioned VM images, so your situation can be similar to our "hosting scenario". So in real life you can expect deduplication factor at least 2..4 per initial TB (let's round to 2 althrough 6-8 and more for VMs in case of 15 TB is more likely).

Compression with zstd,3 will be around 1.2 for mixed data in most cases.

1.73 days / 2 / 1.2 is 0.72 or about 17 hours. Do you think it's still too long?

Well, let's assume you do not want to do backups in your current way (backblaze is really slow) and want to backup important data as fast as possible.

Borg do not have simple way to backup only files by size or date, but have simple way to exclude files from backup. You can generate list of files to exclude (by extension and by size) and make initial backup of all files with this exclude list, then secondary backup without exclude.

Any next secondary backup will be very fast, from our "hosting scenario" daily compressed deduplicated delta is around 50 Mb per 50 Gb or less.

jim-collier commented 4 years ago

@knutov Thanks for the feedback and description of your scenario.

For local backups over gigabit ethernet, I just mirror via rsync (while excluding junk - and with some additional hardlink magic that allows it handle renames and moves without retransmission). ZFS on the receiving end auto snapshots, which takes care of versioning. I don't get anywhere near theoretical throughput - mostly due to rsync not doing any chunking of small files. It varies between KB/s to ~90 MB/s per file, but with a fair amount of other rsync overhead in between. I haven't actually metered the NIC itself, I should do that, now I'm curious.

Either way, I have Borg backing up to Borgbase, over gigabit fiber with 950 Mb/s sustained raw throughput. I'm seeing a fairly sustained average of only about 10 MB/s. According to everything I've read, that's due to limitations with Borg parallelism. (The machine, array, and data connection together have tested significantly faster.) So just going on your math (which seems reasonable albeit optimistic), that's 17.3 days. More realistically probably 20 as a rough estimate and lower precision.

Most of my data is already as compressed as it can get - either by nature of the file format, or because it's encrypted. Multiple VMs have no same bits. (But again - dead last in priority. I don't back up any OS-related files with the very low-priority exception of VMs, and even then it's just barely worth the additional storage costs. My policy is to just reinstall & reconfigure OSes and apps, except for several specialized VM images.)

20 days might not seem like a big deal in a highly controlled corporate datacenter scenario - especially with the primary being on a redundant Btrfs array and the local backup a ZFS array of 3-way mirrors - but we're literally surrounded by wildfires that keep escaping containment, plagued by rolling blackouts, and the threat of prolonged, multi-day "PSP" power outages. Our UPSes are routinely exhausted. More batteries and/or gennies aren't on the table, nor really necessary.

So that context may better illuminate why I've been prioritizing the requirements I have. :-)

BTW my ZFS array doesn't perform nearly that well. Not intended to, it's only function is a local mirror. Redundancy and reliability first, efficiency second; performance a distant third. (It actually used to perform quite well, when it was in a proper server chassis with three SAS cards, proper server board, SSDs ZIL, dual CPUs, ECC, etc. But too finicky, produced too much heat and noise, and consumed too much juice for what it was being used for. Now it's running on 100% older commodity hardware, and has never been more reliable.)

knutov commented 4 years ago

Your described rsync scenario looks similar to rsnapshot. It's definitely not optimal because of file operations with multiple small files - iops and data delta are much bigger than borg will produce.

We created similar to Borgbase service (which is in closed beta now) and "sustained average of only about 10 MB/s" is all about hardware and money. We can always rely on at least 300 sustained megabits with our service which is under heavy load now with multiple clients and for the full gigabit for the freshly setuped storage server.

In case of 15TB data it looks reasonable to rent your own "storage server" for backups directly in some datacenter, so you will able to utilize its full bandwidth and disk io.

zfs performance are always about reading the documentation, forums and googling (and proper hardware). Pure zfs performance are mostly about wrong zfs configuring. We are using cheap server-grade hardware (no sas, no hardware raid, etc) and it's not a problem to be able to serve for the one gigabit (but it's much harder when you want to serve for the 10-40 gigabits of course).

enkore commented 4 years ago

Either way, I have Borg backing up to Borgbase, over gigabit fiber with 950 Mb/s sustained raw throughput. I'm seeing a fairly sustained average of only about 10 MB/s. According to everything I've read, that's due to limitations with Borg parallelism. (The machine, array, and data connection together have tested significantly faster.)

Even for Borg standards in a somewhat IO-constrained environment, that's unusually slow. Does it actually use a significant amount of CPU or is all the time spent waiting on something?

jim-collier commented 4 years ago

@knutov I've been using and tweaking ZFS since 2008, you might have to take my word for it that, even on commodity hardware and currently no SSD caching, it's not the bottleneck. :-)

And now I fee old.

Edit: Also local rsync performance isn't an issue at all. Pushing large files gets close enough to gigabit for me - which is reading from btrfs on one machine and writing to zfs on another - and as fast as it can when millions of small files are involved. If it was a problem I'd implement some other solution, but this has worked fine for over a decade.

@enkore I don't have the time to re-find it, but I've seen discussions about borg and Borgbase, and that seems pretty typical. While I can't personally speculate, I can only parrot what I recall reading, which is an apparent consensus that borg doesn't do enough parallel operations, specifically connections to the remote server. With only 1 or 2 TCP connections and apparently no option to increase, that's not unexpected to a remote server given TCP's horrible congestion control.

It's not remotely CPU, memory, network, or disk IO bound. Loads of headroom all around. It's obviously spending most of it's time waiting.

As I mentioned before, rsync on local network can hit 90 MB/s (both directions) to/from this server. SMB sustains >100 on larger files. UDP streaming over the internet to nearby servers peaks at 950 Mb/s and sustains over 900 indefinitely, with 7ms ping to nearby servers that I've never seen higher than 10ms.

I know little about fine-tuning borg, so that certainly may be an issue. But tuning ZFS, Linux, and networking - got that covered, not an issue. (Though borg's interactions with my networking stack could be another issue. However I've only done standard-issue, modern standards-compliant minor tweaks that no other software or remote services have issue with.)

But either way, I don't find 10 MB/s backup speeds objectionable. A far cry better than Crashplan at least. I only noted 20 day estimated full backup in the context of needing to prioritize more important data (and more files by count thus smaller) first - but not to complain about it. 20 days for 15 TiB is acceptable to me!

And thanks for the thoughtfulness and discussion.

ThomasWaldmann commented 3 years ago

fixed by #5538.