Many servers without cache rebuilding

yhack commented 8 years ago

Is it reasonable to use Borg at scale, backing up many servers to one repository? I'm afraid the need to rebuild the local cache is a blocker for this use case.

I may be way off base with this suggestion, but could the local cache be configured to store only the checksums from the local server? Sure, the servers may send some files to the repository that have already been hashed elsewhere, but no rebuilding would be required (not to mention lower space requirements for the local cache).

:moneybag: there is a bounty for this

ThomasWaldmann commented 8 years ago

@yhack yes, it will take quite some time for cache rebuilds. so the more servers you push into 1 repo, the more archives you have in the repo, the more time it will spend on rebuilding (and also the more space it will need on the backup clients for the archive index cache). there's a "borgception" idea in a ticket, but it is still todo.

About your idea: it's simply the case that the more you reduce the local "known chunks" cache, the more chunks it will unnecessarily process (compress, encrypt, transmit to the backup server).

level323 commented 8 years ago

@yhack, it's my hope that borg will better support backing up many servers to one repository in the not too distant future. This would be a killer feature from my point of view.

@ThomasWaldmann, re:

About your idea: it's simply the case that the more you reduce the local "known chunks" cache, the more chunks it will unnecessarily process (compress, encrypt, transmit to the backup server).

Does it really have to be the case that all of those costs (compress, encrypt, transmit) must be incurred if you reduce (or let go stale) the "known chunks" cache on the borg clients?

If the central borg server possesses an accurate and up-to-date known chunks cache (which it does, or at least could without great difficulty), then couldn't the comms protocol between borg client and borg server be modified to avoid many of these costs?

If the code is modified so that the borg client no longer trusts it's own cache(s) (rather, assumes they are stale) and works on the basis that the borg server is the only authoritative source of info on what is contained in the repo, I see a fairly efficient way forward. Indeed, content may have been added or deleted from the central borg server repo (by a different client) since this particular client last communicated with it. Working on this basis, the borg client can still use it's (stale) cache as a "first cut" on what needs to be sent to the central borg server (and what doesn't). Under this particular approach, before sending chunk data (or not), the borg client first advises the central borg server as follows:

Hey there borg central server, I'm planning to NOT send you chunk ID's x,y,z because my cache says you've already got them -- have you got a problem with that?

OR

Hey there borg central server, I'm planning to send you chunk ID's m,n,o because my cache says you don't have these -- have you got a problem with that?

In each case the borg central server gets a chance to correct the borg client's understanding of what chunks it does and doesn't possess (if the client is wrong because of it's stale cache). Recognising the authority of the borg central server, the client will then send only what the central borg server requires (and won't send what's not required). Furthermore, the responses from the server could then be used to freshen the borg client's cache (on something resembling a need-to-know basis - an optimised local cache update if you will).

Under the above arrangement I could see how some unnecessary chunking will occur, but almost all unnecessary compression, encryption and transmission will be avoided, won't it? The extra two-way banter between client and server will involve a little metadata exchange (and associated bandwidth and latency) but overall a lot less expensive than pointlessly transmitting whole chunks, yes?

I'm particularly interested in avoiding transmission of data that the central borg server already has, as my main desired use case is many machines backing up to a central borg server located remotely and accessible only via a relatively slow/expensive internet link.

enkore commented 8 years ago

What you say is mostly true, and it can be implemented similar to this.

Under the above arrangement I could see how some unnecessary chunking will occur, but almost all unnecessary compression, encryption and transmission will be avoided, won't it? The extra two-way banter between client and server will involve a little metadata exchange (and associated bandwidth and latency) but overall a lot less expensive than pointlessly transmitting whole chunks, yes?

Not necessarily. In LANs this would usually be the case (at least for 1 GbE), but over the internet the additional round-trips can significantly cut into the throughput; i.e this also needs additional buffering to avoid that. Otherwise should be ok.

level323 commented 8 years ago

@enkore I agree with your comment that appropriate buffering and/or multitasking will be required to fully utilise the slow internet link. But that's ultimately achievable and has been done before with many tools facing similar challenges.

yhack commented 8 years ago

@level323 This sounds like the ideal approach to me. In addition, the repository will need to support simultaneous updates from multiple clients. For example, the following event flow:

[client alpha] -> [borg central]   I'm planning to send you chunk XYZ, do you have it?
[borg central]                     (checks repo)
[borg central] -> [client alpha]   Nope, send it to me.

[client beta ] -> [borg central]   I'm planning to send you chunk XYZ, do you have it?
[borg central]                     (checks repo)
[borg central] -> [client beta ]   Nope, send it to me.

[client alpha] -> [borg central]   Here is chunk XYZ.
[borg central]                     (checks repo)
[borg central] -> [client alpha]   Got it. Thanks.

[client beta ] -> [borg central]   Here is chunk XYZ.
[borg central]                     (checks repo, see's XYZ already exists!)
[borg central] -> [client beta ]   I already have it. Thanks anyway.

level323 commented 8 years ago

I want this :-)

USD $100 bounty posted.

ThomasWaldmann commented 8 years ago

Note:

avoid feature creep (see original post and https://github.com/borgbackup/borg/issues/916#issuecomment-213641757) - it might make the bounty scope unclear.
also, we have conflicting requirements here somehow. this is a request to make the ssh: RPC protocol smarter, to let the remote borg agent do more stuff. in another ticket, there is the request for more backends (most of them unable to run such an agent).

level323 commented 8 years ago

@ThomasWaldmann I don't want this feature (or my bounty) to contribute to unhelpful feature creep or in any way push borg in the wrong direction. However, I feel (quite strongly) that this proposed ability is definitely not feature creep.

One of the central purposes of borg is to deduplicate backup data. Nowhere is the scope for backup data deduplication more apparent than the case of backing up files (particularly OS system files) from multiple machines to a central backup store.

I don't read the OP's comments (or my responses) as having any dependence on SSH per se. The relationship is with borg serve only, is it not?

As I see it, the concepts I've proposed in my responses to the the OP (and @yhack elaborates on) represent an important improvement in the design of borg which provides a significant extra feature (backing up from multiple hosts to a central server). However, if implemented thoughtfully I think this change will move borg towards a more powerful, capable, modular client-server design with fewer code paths - and it may even be slimmer in total lines of code. Additionally, I cannot see how this change would limit the potential to improve borg in other areas - if anything I think it is synergistic.

Note: Although below I advocate borg's design be altered to an always-client-server arrangement (even when client and server are the same machine), this approach is by no means necessary. I'm only making the suggestion here and now because it seems to be a change which is synergistic with the OP's issue/feature request. A the end of the day I'll happily pay the bounty as long as the end is achieved (that is, no more expensive/wasteful/redundant cache syncing in multi-client-to-central-server arrangements).

Since the beginning of borg/attic's ability to back up to a remote machine, borg has tried to manage a cache on the client. The main function of this cache was to reduce comms between client and server. Implicit in this design decision is the assumption that the link between client and server provides a speed and/or latency bottleneck. That assumption is correct and borg has always been right in endeavouring to minimise that bandwidth/latency. However, the design of the present caching system assesses the cache in a black-or-white only manner (that is, either "the cache is fully up-to-date and useful" or else "the cache is stale and useless"). If the cache was detected as stale, the whole thing (more or less) is discarded and rebuilt as a separate procedure. There have been discussions about ideas to improve the speed of cache rebuild/sync and minimise flow of redundant data, however real progress has been quite limited. I would suggest that, in hindsight, previous discussions on improving cache sync have been looking at the problem with the wrong assumptions under the hood - specifically the assumption that a current cache is necessary. I contend that a 100% current cache isn't necessary at all to achieve highly effective latency/bandwidth reduction for slow links. In that sense, this proposed change could be viewed as improving the borg client local cache management, with some wonderfully good side-effects.

The conceptual design change I propose is for the client to always assume the client cache is stale, but yet still potentially useful (as per @yhack and my earlier comments). In all use cases I can presently conceive of, this provides:

Advantages:
- In the case of multi-client-to-single-server, each client no longer has to synchronise cache data that is irrelevant to that client, saving time and bandwidth.
- In the case of multi-client-to-single-server, borg can start backing up files almost immediately (after the filesystem scan). This will also improve performance of borg over unreliable remote links (frequent disconnects).
- In various cases (albeit perhaps rare) where the client cache can become stale or incorrect even in single-client-to-single-server arrangements, borg can start backing up files almost immediately (after the filesystem scan). This will also improve performance of borg over unreliable remote links (frequent disconnects)
- Under the proposed new cache management regime, the code around client cache management should be simpler, leaner and easier to follow because the way it treats it's cache is the same every time (that is, my cache is always stale, the server's is always authoritative and I'll minimally update mine in the normal flow of comms with the server)
- The proposed approach to updating the local cache not only avoids updating cache info from other machines/clients that is irrelevant to this client/machine, it also avoids updating the local cache concerning local content that no longer exists (e.g. deleted files or content within files that has changed).
- This moves borg code in the direction (if the developers so choose, of course) of always operating in a client-server fashion, even if client and server are on the same machine. A single, consistent approach and code flow. I think this is a good direction to move in, as there's no reason performance should be significantly affected when client and server are on the same machine. The case of client and server becomes a special case where certain straightforward optimisations kick in (simple bypasses and stubs in place of the normal local cache management code).
- In the case that the local cache is damaged or out-of-date, the local cache is updated minimally, optimally and in the normal course of chunk data transfer - which should scale well as the number of distinct clients increases.
Disadvantages:
- This is a breaking change in terms of the client-server communications protocol
- This change is fairly major in terms of structural underpinnings of the design.
- I can't think of any others - can anyone else think of any?

@ThomasWaldmann If you feel that a new issue needs to be raised to better clarify this proposed change/improvement to avoid some administrative problems you think may occur, please go ahead. Just please please please let this improvement move forward. This is the first time I've used bountysource - I don't know if it's possible to move my bounty to a new ticket but hopefully there's a way.

ThomasWaldmann commented 8 years ago

Read my comment again. It referred to starting from "we ask the server for chunk existance" and later extendeding to "we run multiple backups in parallel". That is what i meant with feature/scope creep.

level323 commented 8 years ago

@ThomasWaldmann Ahhh, gotcha. My apologies. I just revealed how deep my love for the ability to efficiently back up multiple clients to a single server was right there ;-) Don't tell my wife, OK?

I agree that making the central server able to handle multiple simultaneous client backup sessions is not necessary to enable functional and workable multi-client-to-single-server backup solution.

I do think/agree that discussing adding the ability for the server to handle multiple "simultaneous" client sessions is worth exploring, but in a separate ticket. As far as storing chunked data (and querying the existence of it) goes, I can see how the server shouldn't be troubled by whether such a query/request comes from client 001 or client 999. So it's an interesting idea that probably should be explored in a separate ticket.

enkore commented 8 years ago

I thought some more about this as well ;)

When creating archives most data in the Cache is wholly irrelevant. Neither refcount nor csize (or size, really) matters. What the remote (borg serve) can see is only ID & csize anyway. Let the remote keep a list-like[1] index of id (maybe mapped to csize, who cares), and save deltas from newer to older transactions (in a form of [oldn, current], [oldn-1, current], ...[2]). A client would only keep one of those and ask the remote for a delta of [client_transaction, current]. After that is downloaded and merged into what the client has the client has an up-to-date list of available chunks.

What would be interesting in this scenario is when old deltas are thrown away. Could be done by

last-N (if I have M machines doing backups in a round-robin fashion, then I only need M+x (x = safety margin) worth of deltas, after that they will have rotated through)
age-based (if I have M machines and they do a staggered backup every 6 hours, then deltas older than a day are not interesting anymore)
identifying clients to the server (blech!)

Not thought thoroughly about compaction here, but it shouldn't matter too much except that it could make deltas unnecessarily large.

Note that in this scheme you only gain an advantage if many servers do borg-create, but only one does borg-delete/borg-prune, those will need a full chunks index, with refcounts (-- always). Keeping the current logic here would be ok, since it would only affect the speed of the machine doing the prune, not all M machines.

[1] meaning semantics, not implementation / data structure [2] It should be smart enough to throw a delta away if it was made in the same session.

ThomasWaldmann commented 7 years ago

https://github.com/borgbackup/borg/issues/2313

ThomasWaldmann commented 7 years ago

Note: #2313 is closed, 1.1.0b6 with that code is released.

@level323 do you want to carefully test it, see borg create --no-cache-sync.

don't use the beta for production yet.

@enkore ^^^

enkore commented 7 years ago

Re. the topic, "many servers without cache rebuilding"; I've shown many months ago with BorgCube that you can have no cache syncs, fully up-to-date metadata and secure access-control with Borg at the same time. I worked on merging this functionality into Borg itself as the "borg bastion" command [0], but ceased doing so, for a variety of reasons related to the bigger picture and the amount of time I'm willing to invest into Borg.

[0]

usage: borg bastion [-h] [--critical] [--error] [--warning] [--info] [--debug]
                    [--debug-topic TOPIC] [-p] [--log-json] [--lock-wait N]
                    [--show-version] [--show-rc] [--no-files-cache]
                    [--umask M] [--remote-path PATH] [--remote-ratelimit rate]
                    [--consider-part-files] [--debug-profile FILE]
                    [--permit-create-archive GLOB]
                    [--permit-read-archive GLOB]
                    [--permit-compression COMPRESSION] [--log-file FILE]
                    [REPOSITORY]

Start bastion server. This command is usually not used manually.

positional arguments:
  REPOSITORY            permit access to REPOSITORY.

optional arguments:
  --permit-create-archive GLOB
                        permit creation of archives with names matching GLOB.
  --permit-read-archive GLOB
                        permit reading of archives whose names match GLOB.
  --permit-compression COMPRESSION
                        permit client to send chunks compressed using
                        COMPRESSION. COMPRESSION is a comma-separated list of
                        compression algorithms, excluding options like
                        compression level or `auto`. Default:
                        zlib,lz4,lzma,none
  --log-file FILE       write all logging output to FILE. borg bastion offers
                        this, since log output of server processes cannot be
                        redirected.

Common options:
  -h, --help            show this help message and exit
  --critical            work on log level CRITICAL
  --error               work on log level ERROR
  --warning             work on log level WARNING (default)
  --info, -v, --verbose
                        work on log level INFO
  --debug               enable debug output, work on log level DEBUG
  --debug-topic TOPIC   enable TOPIC debugging (can be specified multiple
                        times). The logger path is borg.debug.<TOPIC> if TOPIC
                        is not fully qualified.
  -p, --progress        show progress information
  --log-json            Output one JSON object per log line instead of
                        formatted text.
  --lock-wait N         wait for the lock, but max. N seconds (default: 1).
  --show-version        show/log the borg version
  --show-rc             show/log the return code (rc)
  --no-files-cache      do not load/update the file metadata cache used to
                        detect unchanged files
  --umask M             set umask to M (local and remote, default: 0077)
  --remote-path PATH    use PATH as borg executable on the remote (default:
                        "borg")
  --remote-ratelimit rate
                        set remote network upload rate limit in kiByte/s
                        (default: 0=unlimited)
  --consider-part-files
                        treat part files like normal files (e.g. to
                        list/extract them)
  --debug-profile FILE  Write execution profile in Borg format into FILE. For
                        local use a Python-compatible file can be generated by
                        suffixing FILE with ".pyprof".

This is an experimental feature.

This commands provides a fine-grained access-control layer for Borg repositories.

This command is a repository proxy. It is usually not used manually.
Similar to "borg serve", "borg bastion" is invoked by a Borg client remotely through SSH.

While "borg serve" can be used with SSH forced commands, "borg bastion" must be used
with forced commands, otherwise a client could invoke arbitrary commands.

The server running bastion must be able to access the backend repository specified
by REPOSITORY, i.e. it must possess the necessary key.

The client connecting to the bastion requires neither the key nor access to the backend repository.

The bastion appears to the client like a repository that only contains archives
permitted by --permit-read-archive. The bastion checks every operation the client
makes and terminates the session if prohibited operations are carried out. All data
send by the client is verified. A client is unable to push "poisoned" chunks to the repository,
i.e. chunks whose contents do not match their chunk ID. A client is unable to overwrite
existing chunks. A client is unable to delete archives or data. A client can create archives,
but only if their name matches --permit-create-archive.

The bastion has to decompress chunks sent by the client. This may allow the client to
exploit vulnerabilities in decompressors. Use the --permit-compression option to restrict
which decompressors the bastion will use.

If the backend repository is encrypted, then the bastion will synthesize a "similar" encryption
key that it sends to the client. This "similar" key contains the same ID hash algorithm,
the same ID key and the same chunker seed, but different encryption and MAC keys.

The identical ID hash allows the client to deduplicate against the backend repository,
while the changed encryption keys won't compromise the repository's confidentiality or
integrity.

Since the client possesses the ID hash, it can fingerpint contents in the backend repository,
assuming it has read access to it.

Refer to the bastion deployment manual for more information on this feature and
how to deploy it.

enkore commented 7 years ago

To expand on "bigger picture": Hacks around borg create are clearly viable, but they are also just that: hacks.

With 1.2 I see a much less strongly coupled architecture, and implementing a pull-files-over-network (split create: traversal & dedup in one process, encryption & control in another) is feasible there (it is not with the 1.0/1.1 code base). I also consider that as much closer to what is really needed. Whether access-control for pushing stuff is still needed if pulling is available, like implemented in BorgCube and borg bastion, is not clear (may still be useful / easier in some instances, and reading access-control is still useful). All of these are a significant effort to develop and maintain, and I see no reason why Borg needs to cater to every use case in this field.

level323 commented 6 years ago

@ThomasWaldmann Somehow I completely missed your request to test --no-cache-sync. My apologies for that.

I would like to test this feature as it seems it might go part of the way towards resolving this issue. But first I need to better understand the effect of this option. I haven't been able to find a description of the implications of --no-cache-sync beyond that implied by the name. I tried reading/deciphering the various Github issues related/linked to --no-cache-sync but declare defeat at pulling all the threads together.

Some questions:

Does --no-cache-sync prevent the initial syncing of BOTH the files cache and chunks cache?
Does --no-cache-sync invalidate all the caches on the sending side completely, thereby causing the sending side to rechunk every file being backed up? Or does the sending side use unsynced/stale cache(s) in it's possession to (imperfectly) guess what files need to be chunked+transmitted, but allow the borg server on the receiving end to signal the sending side to correct any mistakes it makes in this regard (by informing the sending side that it already has a chunk that the sender thinks it needs and/or to send a chunk that the sender mistakenly thinks it already has)? Or is something else going on?

Once I have a better understanding of what --no-cache-sync does I can know how to test it.

In my own use of borg (where I'm trying to backup multiple machines to a single borg repo) I'm already hitting the issue of painfully long cache syncs, so I'm keen to move this forward. I'm also willing to consider offering more bounties on this issue to move this issue towards final and stable closure.

gatlex commented 5 years ago

@level323: +1.

I too think a (short) description of the effects of the option would be useful for those willing to join the beta test. I think it would best be mentioned in this FAQ entry as a side note, clearly marking it experimental with some explanations about the implications.

Three questions regarding performance that imo are of particular interest:

How does it perform to not using the flag when there is only a single server backing up into the repository?
How does it perform when another server backs up into the repo but the chunks are disjoint?
Does it update the local cache when a hit on in the repo from another source was found? (That's how I read the comments of 2313. If so, it should probably better be called --ad-hoc-cache-sync.)

Btw, what's the state of this feature, has it been thoroughly tested by someone?

FabioPedretti commented 5 years ago

An alternative I am using is to set up a backup server, mount through NFS all the servers you want to backup and do all backups from this central server.

level323 commented 5 years ago

@gatlex I haven't seen any reports on testing of this option yet.

I was hoping for the requested description of the effects of --no-cache-sync from the devs (a friendly tap on the shoulder to @ThomasWaldmann , @enkore etc.) to enable thorough testing. I'm so keen for this I would normally be reaching the point where I would scour the code myself to try and discern what the option did but unfortunately external circumstances have left me extremely time-poor for the forseeable future.

I'm time poor but not cash poor ;-).... still willing to make a significant contribution to a significant improvement to borg's speed in the use case of backing up many machines to a single repo.

ThomasWaldmann commented 5 years ago

iirc i did a few quick tests after the PR linked above was merged into 1.1.

it worked, but iirc some functionality was still missing yet. so maybe read the PR, commit comment and this issue to get a better impression about what it does and what not.

level323 commented 5 years ago

@ThomasWaldmann Thanks for the reply. I have read the PR and commit. I have understood both to the extent possible for someone who knows python but is not familiar with the borg codebase. I have some further important questions that I feel need to be answered before I can test it (see at the end of this post):

But first, here's a summary of what I presently understand (and don't understand) about the effect of --no-cache-sync. Please advise if I've got any of this wrong:

The files cache is not used at all. So all files are definitely rechunked on the first run (see also next point)
The local ad-hoc cache does not appear to remember files at all (that is, it doesn't appear to remember files added to or found to be already existing in the repo). So not only will all files be re-chunked on the first run, but also on all subsequent runs. @ThomasWaldmann please advise if I'm correct on this - it's a particularly important issue for me to understand for my use case
- If I'm correct on the above point, then there likely to be numerous use cases where performance of --no-cache-sync will be worse.
By my reading, it seems to me that if the local ad-hoc cache thinks a chunk exists in the repo it will not attempt to push the chunk (or advise the receiving end that the chunk has been skipped which would give the receiving end the opportunity to respond with 'actually I don't have that chunk any more, please send it'). This opens the potential for data loss if an earlier operation by another machine has caused chunks to be removed from the repo (via borg delete or borg prune).

I don't presently have the time to test this feature using test repos, but I do have in-production repos (where multiple machines are backing up to a central repo, and suffering under long cache sync phases). So my questions are:

This feature is marked as experimental. Does "experimental" mean that it might corrupt a repo? If yes, I obviously should not test it on my production repos. I guess I'm unclear on the meaning of 'experimental'. Does 'experimental' mean "it might break your data" or "it's data-safe, but the extent of benefit it provides is not yet known"?
Am I correct about the data loss issue described two paragraphs above? I hope not.

gatlex commented 5 years ago

It may be a stupid idea (and to be explicit about it has nothing to do with the experimental --no-cache-sync option much discussed here) but wouldn't it be possible to maintain a cache (better the cache) within the repo itself. When the local cache of the client is not identical to the repo's (hash them, for example), then update the local cache to match the repo's (rsync or the like).

My apologies if this is a bit naïve, I'm not into borg's code base...

ThomasWaldmann commented 3 weeks ago

I added an AdHocWithFilesCache implementation in 2.0.0b9 (and made it default):

it does not have a persistent chunks cache (so nothing needs to be expensively rebuilt).
it ad-hoc queries existing chunk ids from the repo.
it has a persistent files cache (in contrast to the older AdHocCache implementation), so it is faster now.

borgbackup / borg

Many servers without cache rebuilding #916