borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
10.88k stars 736 forks source link

Many servers without cache rebuilding #916

Open yhack opened 8 years ago

yhack commented 8 years ago

Is it reasonable to use Borg at scale, backing up many servers to one repository? I'm afraid the need to rebuild the local cache is a blocker for this use case.

I may be way off base with this suggestion, but could the local cache be configured to store only the checksums from the local server? Sure, the servers may send some files to the repository that have already been hashed elsewhere, but no rebuilding would be required (not to mention lower space requirements for the local cache).


:moneybag: there is a bounty for this

ThomasWaldmann commented 8 years ago

@yhack yes, it will take quite some time for cache rebuilds. so the more servers you push into 1 repo, the more archives you have in the repo, the more time it will spend on rebuilding (and also the more space it will need on the backup clients for the archive index cache). there's a "borgception" idea in a ticket, but it is still todo.

About your idea: it's simply the case that the more you reduce the local "known chunks" cache, the more chunks it will unnecessarily process (compress, encrypt, transmit to the backup server).

level323 commented 8 years ago

@yhack, it's my hope that borg will better support backing up many servers to one repository in the not too distant future. This would be a killer feature from my point of view.

@ThomasWaldmann, re:

About your idea: it's simply the case that the more you reduce the local "known chunks" cache, the more chunks it will unnecessarily process (compress, encrypt, transmit to the backup server).

Does it really have to be the case that all of those costs (compress, encrypt, transmit) must be incurred if you reduce (or let go stale) the "known chunks" cache on the borg clients?

If the central borg server possesses an accurate and up-to-date known chunks cache (which it does, or at least could without great difficulty), then couldn't the comms protocol between borg client and borg server be modified to avoid many of these costs?

If the code is modified so that the borg client no longer trusts it's own cache(s) (rather, assumes they are stale) and works on the basis that the borg server is the only authoritative source of info on what is contained in the repo, I see a fairly efficient way forward. Indeed, content may have been added or deleted from the central borg server repo (by a different client) since this particular client last communicated with it. Working on this basis, the borg client can still use it's (stale) cache as a "first cut" on what needs to be sent to the central borg server (and what doesn't). Under this particular approach, before sending chunk data (or not), the borg client first advises the central borg server as follows:

OR

In each case the borg central server gets a chance to correct the borg client's understanding of what chunks it does and doesn't possess (if the client is wrong because of it's stale cache). Recognising the authority of the borg central server, the client will then send only what the central borg server requires (and won't send what's not required). Furthermore, the responses from the server could then be used to freshen the borg client's cache (on something resembling a need-to-know basis - an optimised local cache update if you will).

Under the above arrangement I could see how some unnecessary chunking will occur, but almost all unnecessary compression, encryption and transmission will be avoided, won't it? The extra two-way banter between client and server will involve a little metadata exchange (and associated bandwidth and latency) but overall a lot less expensive than pointlessly transmitting whole chunks, yes?

I'm particularly interested in avoiding transmission of data that the central borg server already has, as my main desired use case is many machines backing up to a central borg server located remotely and accessible only via a relatively slow/expensive internet link.

enkore commented 8 years ago

What you say is mostly true, and it can be implemented similar to this.

Under the above arrangement I could see how some unnecessary chunking will occur, but almost all unnecessary compression, encryption and transmission will be avoided, won't it? The extra two-way banter between client and server will involve a little metadata exchange (and associated bandwidth and latency) but overall a lot less expensive than pointlessly transmitting whole chunks, yes?

Not necessarily. In LANs this would usually be the case (at least for 1 GbE), but over the internet the additional round-trips can significantly cut into the throughput; i.e this also needs additional buffering to avoid that. Otherwise should be ok.

level323 commented 8 years ago

@enkore I agree with your comment that appropriate buffering and/or multitasking will be required to fully utilise the slow internet link. But that's ultimately achievable and has been done before with many tools facing similar challenges.

yhack commented 8 years ago

@level323 This sounds like the ideal approach to me. In addition, the repository will need to support simultaneous updates from multiple clients. For example, the following event flow:

[client alpha] -> [borg central]   I'm planning to send you chunk XYZ, do you have it?
[borg central]                     (checks repo)
[borg central] -> [client alpha]   Nope, send it to me.

[client beta ] -> [borg central]   I'm planning to send you chunk XYZ, do you have it?
[borg central]                     (checks repo)
[borg central] -> [client beta ]   Nope, send it to me.

[client alpha] -> [borg central]   Here is chunk XYZ.
[borg central]                     (checks repo)
[borg central] -> [client alpha]   Got it. Thanks.

[client beta ] -> [borg central]   Here is chunk XYZ.
[borg central]                     (checks repo, see's XYZ already exists!)
[borg central] -> [client beta ]   I already have it. Thanks anyway.
level323 commented 8 years ago

I want this :-)

USD $100 bounty posted.

ThomasWaldmann commented 8 years ago

Note:

level323 commented 8 years ago

@ThomasWaldmann I don't want this feature (or my bounty) to contribute to unhelpful feature creep or in any way push borg in the wrong direction. However, I feel (quite strongly) that this proposed ability is definitely not feature creep.

One of the central purposes of borg is to deduplicate backup data. Nowhere is the scope for backup data deduplication more apparent than the case of backing up files (particularly OS system files) from multiple machines to a central backup store.

I don't read the OP's comments (or my responses) as having any dependence on SSH per se. The relationship is with borg serve only, is it not?

As I see it, the concepts I've proposed in my responses to the the OP (and @yhack elaborates on) represent an important improvement in the design of borg which provides a significant extra feature (backing up from multiple hosts to a central server). However, if implemented thoughtfully I think this change will move borg towards a more powerful, capable, modular client-server design with fewer code paths - and it may even be slimmer in total lines of code. Additionally, I cannot see how this change would limit the potential to improve borg in other areas - if anything I think it is synergistic.

Note: Although below I advocate borg's design be altered to an always-client-server arrangement (even when client and server are the same machine), this approach is by no means necessary. I'm only making the suggestion here and now because it seems to be a change which is synergistic with the OP's issue/feature request. A the end of the day I'll happily pay the bounty as long as the end is achieved (that is, no more expensive/wasteful/redundant cache syncing in multi-client-to-central-server arrangements).

Since the beginning of borg/attic's ability to back up to a remote machine, borg has tried to manage a cache on the client. The main function of this cache was to reduce comms between client and server. Implicit in this design decision is the assumption that the link between client and server provides a speed and/or latency bottleneck. That assumption is correct and borg has always been right in endeavouring to minimise that bandwidth/latency. However, the design of the present caching system assesses the cache in a black-or-white only manner (that is, either "the cache is fully up-to-date and useful" or else "the cache is stale and useless"). If the cache was detected as stale, the whole thing (more or less) is discarded and rebuilt as a separate procedure. There have been discussions about ideas to improve the speed of cache rebuild/sync and minimise flow of redundant data, however real progress has been quite limited. I would suggest that, in hindsight, previous discussions on improving cache sync have been looking at the problem with the wrong assumptions under the hood - specifically the assumption that a current cache is necessary. I contend that a 100% current cache isn't necessary at all to achieve highly effective latency/bandwidth reduction for slow links. In that sense, this proposed change could be viewed as improving the borg client local cache management, with some wonderfully good side-effects.

The conceptual design change I propose is for the client to always assume the client cache is stale, but yet still potentially useful (as per @yhack and my earlier comments). In all use cases I can presently conceive of, this provides:

@ThomasWaldmann If you feel that a new issue needs to be raised to better clarify this proposed change/improvement to avoid some administrative problems you think may occur, please go ahead. Just please please please let this improvement move forward. This is the first time I've used bountysource - I don't know if it's possible to move my bounty to a new ticket but hopefully there's a way.

ThomasWaldmann commented 8 years ago

Read my comment again. It referred to starting from "we ask the server for chunk existance" and later extendeding to "we run multiple backups in parallel". That is what i meant with feature/scope creep.

level323 commented 8 years ago

@ThomasWaldmann Ahhh, gotcha. My apologies. I just revealed how deep my love for the ability to efficiently back up multiple clients to a single server was right there ;-) Don't tell my wife, OK?

I agree that making the central server able to handle multiple simultaneous client backup sessions is not necessary to enable functional and workable multi-client-to-single-server backup solution.

I do think/agree that discussing adding the ability for the server to handle multiple "simultaneous" client sessions is worth exploring, but in a separate ticket. As far as storing chunked data (and querying the existence of it) goes, I can see how the server shouldn't be troubled by whether such a query/request comes from client 001 or client 999. So it's an interesting idea that probably should be explored in a separate ticket.

enkore commented 8 years ago

I thought some more about this as well ;)

When creating archives most data in the Cache is wholly irrelevant. Neither refcount nor csize (or size, really) matters. What the remote (borg serve) can see is only ID & csize anyway. Let the remote keep a list-like[1] index of id (maybe mapped to csize, who cares), and save deltas from newer to older transactions (in a form of [oldn, current], [oldn-1, current], ...[2]). A client would only keep one of those and ask the remote for a delta of [client_transaction, current]. After that is downloaded and merged into what the client has the client has an up-to-date list of available chunks.

What would be interesting in this scenario is when old deltas are thrown away. Could be done by

Not thought thoroughly about compaction here, but it shouldn't matter too much except that it could make deltas unnecessarily large.

Note that in this scheme you only gain an advantage if many servers do borg-create, but only one does borg-delete/borg-prune, those will need a full chunks index, with refcounts (-- always). Keeping the current logic here would be ok, since it would only affect the speed of the machine doing the prune, not all M machines.

[1] meaning semantics, not implementation / data structure [2] It should be smart enough to throw a delta away if it was made in the same session.

ThomasWaldmann commented 7 years ago

https://github.com/borgbackup/borg/issues/2313

ThomasWaldmann commented 7 years ago

Note: #2313 is closed, 1.1.0b6 with that code is released.

@level323 do you want to carefully test it, see borg create --no-cache-sync.

don't use the beta for production yet.

@enkore ^^^

enkore commented 7 years ago

Re. the topic, "many servers without cache rebuilding"; I've shown many months ago with BorgCube that you can have no cache syncs, fully up-to-date metadata and secure access-control with Borg at the same time. I worked on merging this functionality into Borg itself as the "borg bastion" command [0], but ceased doing so, for a variety of reasons related to the bigger picture and the amount of time I'm willing to invest into Borg.

[0]

usage: borg bastion [-h] [--critical] [--error] [--warning] [--info] [--debug]
                    [--debug-topic TOPIC] [-p] [--log-json] [--lock-wait N]
                    [--show-version] [--show-rc] [--no-files-cache]
                    [--umask M] [--remote-path PATH] [--remote-ratelimit rate]
                    [--consider-part-files] [--debug-profile FILE]
                    [--permit-create-archive GLOB]
                    [--permit-read-archive GLOB]
                    [--permit-compression COMPRESSION] [--log-file FILE]
                    [REPOSITORY]

Start bastion server. This command is usually not used manually.

positional arguments:
  REPOSITORY            permit access to REPOSITORY.

optional arguments:
  --permit-create-archive GLOB
                        permit creation of archives with names matching GLOB.
  --permit-read-archive GLOB
                        permit reading of archives whose names match GLOB.
  --permit-compression COMPRESSION
                        permit client to send chunks compressed using
                        COMPRESSION. COMPRESSION is a comma-separated list of
                        compression algorithms, excluding options like
                        compression level or `auto`. Default:
                        zlib,lz4,lzma,none
  --log-file FILE       write all logging output to FILE. borg bastion offers
                        this, since log output of server processes cannot be
                        redirected.

Common options:
  -h, --help            show this help message and exit
  --critical            work on log level CRITICAL
  --error               work on log level ERROR
  --warning             work on log level WARNING (default)
  --info, -v, --verbose
                        work on log level INFO
  --debug               enable debug output, work on log level DEBUG
  --debug-topic TOPIC   enable TOPIC debugging (can be specified multiple
                        times). The logger path is borg.debug.<TOPIC> if TOPIC
                        is not fully qualified.
  -p, --progress        show progress information
  --log-json            Output one JSON object per log line instead of
                        formatted text.
  --lock-wait N         wait for the lock, but max. N seconds (default: 1).
  --show-version        show/log the borg version
  --show-rc             show/log the return code (rc)
  --no-files-cache      do not load/update the file metadata cache used to
                        detect unchanged files
  --umask M             set umask to M (local and remote, default: 0077)
  --remote-path PATH    use PATH as borg executable on the remote (default:
                        "borg")
  --remote-ratelimit rate
                        set remote network upload rate limit in kiByte/s
                        (default: 0=unlimited)
  --consider-part-files
                        treat part files like normal files (e.g. to
                        list/extract them)
  --debug-profile FILE  Write execution profile in Borg format into FILE. For
                        local use a Python-compatible file can be generated by
                        suffixing FILE with ".pyprof".

This is an experimental feature.

This commands provides a fine-grained access-control layer for Borg repositories.

This command is a repository proxy. It is usually not used manually.
Similar to "borg serve", "borg bastion" is invoked by a Borg client remotely through SSH.

While "borg serve" can be used with SSH forced commands, "borg bastion" must be used
with forced commands, otherwise a client could invoke arbitrary commands.

The server running bastion must be able to access the backend repository specified
by REPOSITORY, i.e. it must possess the necessary key.

The client connecting to the bastion requires neither the key nor access to the backend repository.

The bastion appears to the client like a repository that only contains archives
permitted by --permit-read-archive. The bastion checks every operation the client
makes and terminates the session if prohibited operations are carried out. All data
send by the client is verified. A client is unable to push "poisoned" chunks to the repository,
i.e. chunks whose contents do not match their chunk ID. A client is unable to overwrite
existing chunks. A client is unable to delete archives or data. A client can create archives,
but only if their name matches --permit-create-archive.

The bastion has to decompress chunks sent by the client. This may allow the client to
exploit vulnerabilities in decompressors. Use the --permit-compression option to restrict
which decompressors the bastion will use.

If the backend repository is encrypted, then the bastion will synthesize a "similar" encryption
key that it sends to the client. This "similar" key contains the same ID hash algorithm,
the same ID key and the same chunker seed, but different encryption and MAC keys.

The identical ID hash allows the client to deduplicate against the backend repository,
while the changed encryption keys won't compromise the repository's confidentiality or
integrity.

Since the client possesses the ID hash, it can fingerpint contents in the backend repository,
assuming it has read access to it.

Refer to the bastion deployment manual for more information on this feature and
how to deploy it.
enkore commented 7 years ago

To expand on "bigger picture": Hacks around borg create are clearly viable, but they are also just that: hacks.

With 1.2 I see a much less strongly coupled architecture, and implementing a pull-files-over-network (split create: traversal & dedup in one process, encryption & control in another) is feasible there (it is not with the 1.0/1.1 code base). I also consider that as much closer to what is really needed. Whether access-control for pushing stuff is still needed if pulling is available, like implemented in BorgCube and borg bastion, is not clear (may still be useful / easier in some instances, and reading access-control is still useful). All of these are a significant effort to develop and maintain, and I see no reason why Borg needs to cater to every use case in this field.

level323 commented 6 years ago

@ThomasWaldmann Somehow I completely missed your request to test --no-cache-sync. My apologies for that.

I would like to test this feature as it seems it might go part of the way towards resolving this issue. But first I need to better understand the effect of this option. I haven't been able to find a description of the implications of --no-cache-sync beyond that implied by the name. I tried reading/deciphering the various Github issues related/linked to --no-cache-sync but declare defeat at pulling all the threads together.

Some questions:

Once I have a better understanding of what --no-cache-sync does I can know how to test it.

In my own use of borg (where I'm trying to backup multiple machines to a single borg repo) I'm already hitting the issue of painfully long cache syncs, so I'm keen to move this forward. I'm also willing to consider offering more bounties on this issue to move this issue towards final and stable closure.

gatlex commented 5 years ago

@level323: +1.

I too think a (short) description of the effects of the option would be useful for those willing to join the beta test. I think it would best be mentioned in this FAQ entry as a side note, clearly marking it experimental with some explanations about the implications.

Three questions regarding performance that imo are of particular interest:

  1. How does it perform to not using the flag when there is only a single server backing up into the repository?
  2. How does it perform when another server backs up into the repo but the chunks are disjoint?
  3. Does it update the local cache when a hit on in the repo from another source was found? (That's how I read the comments of 2313. If so, it should probably better be called --ad-hoc-cache-sync.)

Btw, what's the state of this feature, has it been thoroughly tested by someone?

FabioPedretti commented 5 years ago

An alternative I am using is to set up a backup server, mount through NFS all the servers you want to backup and do all backups from this central server.

level323 commented 5 years ago

@gatlex I haven't seen any reports on testing of this option yet.

I was hoping for the requested description of the effects of --no-cache-sync from the devs (a friendly tap on the shoulder to @ThomasWaldmann , @enkore etc.) to enable thorough testing. I'm so keen for this I would normally be reaching the point where I would scour the code myself to try and discern what the option did but unfortunately external circumstances have left me extremely time-poor for the forseeable future.

I'm time poor but not cash poor ;-).... still willing to make a significant contribution to a significant improvement to borg's speed in the use case of backing up many machines to a single repo.

ThomasWaldmann commented 5 years ago

iirc i did a few quick tests after the PR linked above was merged into 1.1.

it worked, but iirc some functionality was still missing yet. so maybe read the PR, commit comment and this issue to get a better impression about what it does and what not.

level323 commented 5 years ago

@ThomasWaldmann Thanks for the reply. I have read the PR and commit. I have understood both to the extent possible for someone who knows python but is not familiar with the borg codebase. I have some further important questions that I feel need to be answered before I can test it (see at the end of this post):

But first, here's a summary of what I presently understand (and don't understand) about the effect of --no-cache-sync. Please advise if I've got any of this wrong:

I don't presently have the time to test this feature using test repos, but I do have in-production repos (where multiple machines are backing up to a central repo, and suffering under long cache sync phases). So my questions are:

gatlex commented 5 years ago

It may be a stupid idea (and to be explicit about it has nothing to do with the experimental --no-cache-sync option much discussed here) but wouldn't it be possible to maintain a cache (better the cache) within the repo itself. When the local cache of the client is not identical to the repo's (hash them, for example), then update the local cache to match the repo's (rsync or the like).

My apologies if this is a bit naïve, I'm not into borg's code base...

ThomasWaldmann commented 3 weeks ago

I added an AdHocWithFilesCache implementation in 2.0.0b9 (and made it default):