Borg backup to Amazon S3 on FUSE?

geckolinux commented 9 years ago

Hi everyone,

I'm interested in using Borg to backup my webserver to an Amazon S3 bucket. I've been using Duplicity, but I'm sick of the full/incremental model, as well as the difficulty of pruning backups. I love the ease of use and features that Borg provides, but I don't really understand the internals and I'm not sure if it will work with Amazon S3 storage.

Specifically, I'm considering mounting my S3 bucket over FUSE, using one of the following three options:

Any comments on which, if any would be more appropriate? And how tolerant would Borg be to S3's "eventual consistency" weirdness?

Additionally, I need to plan against the worst-case scenario of a hacker getting root access to my server and deleting the backups on S3 using the stored credentials on my server. To eliminate this possibility, I was thinking about enabling S3 versioning on the bucket so that files deleted with my server's S3 user account can still be recovered via my main Amazon user account. Then, I would have S3 lifecycle management configured to delete all versions of deleted files after X amount of time. In this case,

How much of my S3 data would Borg routinely need to download in order to figure out which files have changed and need to be backup up? (I'm worried about bandwidth costs.)
How much accumulated clutter and wasted space could I expect from files that Borg "deletes", (which will actually be retained on S3 due to the versioning)?

Again, my concerns are based on me not really understanding all the black magic that happens with all the chunks and indexes inside a Borg repository, and how much they change from one backup to the next.

Thanks in advance for the help!

:moneybag: there is a bounty for this

geckolinux commented 9 years ago

I'm still trying to get an idea of what exactly happens in the Borg repo from one run to the next. I used it to backup my ~/ directory (about 72GB on disk) last night, and I messed around with creating and deleting files and re-combining ISO images to see how well the de-dupe works. (It works extremely well, I might add!) I ran around 30 backups with no pruning. That was last night, and then today I used my computer for some web browsing and then ran another backup with a before and after ls -sl on the repo/data/1 directory . Here's a diff of repo/data/1 before and after: http://paste.ubuntu.com/11910814/ (1 chunk deleted, 4 added, total change of 5)

Then I pruned all but the most recent backup and ran another diff: http://paste.ubuntu.com/11910824/ And here's the repo/data/0 directory, just the names of deleted files: http://paste.ubuntu.com/11910839/ (580 chunks deleted, 75 added, total change of 655)

So assuming that all the chunks are around 5MB, that would be around 3GB of deleted data taking up wasted space in Amazon S3, which would cost me about $0.05/month in Glacier according to Amazon's calculator, and it would have to stay there for 90 days to avoid a penalty. Or else in regular S3 storage it would cost something like $0.11/month. Additionally there would be far fewer changes and much less total data stored in the case of my webserver I want to back up with this scheme.

So I would tentatively think this could be a good option?

oderwat commented 9 years ago

I might add that you can get 10 TB (thats ten terrabyte) as "nearly" OpenStack Swift compatible storage from HubiC.com for 50 Euro a year (no kidding). I use this together with my Hubic Swift Gateway and the swift duplicity back end.

This also is EU storage (located in france) which solves some problems with German laws.

I also think that it is fairly easy to implement as backend for software with a chunked approach.

P.S.: Their desktop client (still) sucks imho... but you even get 25 GB for free. Which can also be used for experiments with the API.

geckolinux commented 9 years ago

Thanks @oderwat for the tip! Good to know.

ThomasWaldmann commented 9 years ago

I must say that I don't use "cloud data storage services", so I can't advise about their API/capabilities.

Borg's backend is similar to a key/value storage and segment files only get created/written, but not modified (except from complete segment files being deleted), so it could be possible if someone writes such a backend.

Borg has an "internals" doc that might be interesting for anybody wanting to write such a backend. If information is missing there, please file a docs issue here.

anarcat commented 8 years ago

borg has some level of abstraction of remote repositories... there's currently only a single RemoteRepository implementation, and it hardcodes ssh in a bunch of place. we nevertheless have a list of methods we use in RPC calls that would need to be defined more clearly, maybe cleaned up, and then implemented in such a new implementation:

    rpc_methods = (
        '__len__',
        'check',
        'commit',
        'delete',
        'destroy',
        'get',
        'list',
        'negotiate',
        'open',
        'put',
        'repair',
        'rollback',
        'save_key',
        'load_key',
    )

this list is from remote.py, and is passed through the SSH pipe during communication with the borg serve command...

anarcat commented 8 years ago

notice the similar issue in https://github.com/jborg/attic/issues/136

rmoriz commented 8 years ago

Supporting storage services like AWS S3 would be huge and make borg a real alternative to tools like tarsnap. I would support a bounty for a) generic storage interface layer b) and S3 support based on it. I suggest libcloud https://libcloud.readthedocs.org/en/latest/storage/supported_providers.html to design interfaces/deal with cloud storage services.

Another interesting backend storage might be sftp/scp, as provided by some traditional hosting providers, like Hetzner or Strato HiDrive

anarcat commented 8 years ago

@rmoriz your contribution would of course be welcome. bounties are organised on bountysource, in this case: https://www.bountysource.com/issues/24578298-borg-backup-to-amazon-s3-on-fuse

the main problem with S3 and other cloud providers is that we can't run native code on the other side, which we currently expect for remote server support. our remote server support involves calling fairly high-level functions like check on the remote side, which can't possibly be implemented directly in the native S3 API: we'd need to treat those as different remotes. see also https://github.com/borgbackup/borg/issues/191#issuecomment-145749312 about this...

the assumptions we make about the remotes also imply that the current good performance we get on SSH-based remotes would be affected by "dumb" remotes like key/object value storage. see also https://github.com/borgbackup/borg/issues/36#issuecomment-145918610 for this.

rmoriz commented 8 years ago

Please correct my if I'm wrong.

It looks like we have/need a three-tier architecture:

borg client
borg server (via ssh)
(dumb) storage.

So the borg server part needs a storage abstraction model where backends like S3, ftps, Google Cloud Storage, etc. can be added.

Is that correct? I think using FUSE adapters are not a reliable way (IMHO).

Update:

bounty added: https://www.bountysource.com/issues/24578298-borg-backup-to-amazon-s3-on-fuse
@RonnyPfannschmidt that solution would be even better. I would love to see this happen.
please discuss possible implementations with the maintainers before starting work. Thank you.

RonnyPfannschmidt commented 8 years ago

the server is not necessaryly needed

borgs internal structure would allow to use something like a different k/v store as well - but someone needs to do and test it

ThomasWaldmann commented 8 years ago

Thanks for putting a bounty on this.

If someone wants to take it: please discuss implementation here beforehands, do not work in the dark.

jasonfharris commented 8 years ago

+1 for me on this. I want exactly what the original poster is talking about. Also since I am worrying about deduplicating I want to use some really highly durable storage like amazon has. Also the versioning life-cycles to protect against the "compromised" host problem would be fantastic... (I added to the bounty :) )

asteadman commented 8 years ago

I've written up some of my thoughts on some of the limitations of s3, and a WIP discussion about some possible method to address them. It is organised as a single document right now, but as it flushes out, I will expand it as appropriate. Please comment there and I will try and keep the document up to date with as much information as possible. see https://gist.github.com/asteadman/bd79833a325df0776810

Any feedback is appreciated. Thank you.

ThomasWaldmann commented 8 years ago

the problematic points (as you have partly noticed already):

using 1 file per chunk is not gonna work practically - too many chunks, too much overhead. you have to consider that 1 chunk is not just the usual 64kiB (or soon: 1MiB) target chunk size, but can be way smaller if the input file is smaller. you can't really ignore that in the end, this is something that has to be solved.
the archive metadata (list of all files, metadata of files, chunk lists) can be quite large, so you won't be able / you won't want to store this in one piece. borg currently runs this metadata stream through chunker / deduplication also, which is quite nice because we always have the full(!) item list there and a lot of it is not changing usually.
"skipping chunks that already exist" - if you want to do that quickly, you need an up-to-date (consistent) local index / hash table. otherwise, you may have 1 network roundtrip per chunk.
that "eventually consistent" S3 property is scary. it's already hard enough to design such a system without that property.
"chunk staleness" is an interesting idea. but i think you could run into race conditions - e.g. you just decided that this 3 months old chunk shall be killed, when a parallel backup task decided to use it again. guess either atomicity or locking is needed here.

ThomasWaldmann commented 8 years ago

Yes, target chunk size in 1.0 will be 1 or 2MiB. That doesn't mean that there will be no tiny chunks - if you file only has 1 byte, it will be still 1 chunk. So, the average might be lower than the target size.

BTW, it is still unclear to me how you want to work without locking, with parallelel operations allowed (including deletion). I also do not think that making this github issue longer and longer with back-and-forth discussion posts is helping here very much - if we want to implement this, we need ONE relatively formal description of how it works (not many pages in discussion mode).

So I'ld suggest you please rather edit one of your posts and update it as needed until it implements everything needed or until we find it can't be implemented. Also, the other posts (including mine) should be removed after integration. I am also not sure a gh issue is the best for that, maybe a github repo, where one can see diffs and history would be better.

ThomasWaldmann commented 8 years ago

http://www.daemonology.net/blog/2008-12-14-how-tarsnap-uses-aws.html doesn't sound too promising about the possibility of reliably using S3 directly from a backup tool (he wrote a special server that sits between the backup client and S3).

rmoriz commented 8 years ago

@TW that post was from 2008… https://aws.amazon.com/de/s3/faqs/#How_durable_is_Amazon_S3

RonnyPfannschmidt commented 8 years ago

@ThomasWaldmann - actually its promising - it's not too different from what borg is already doing in the local format - and it might not need too much of a change to make borg work against it

olivernz commented 8 years ago

Don't forget BackBlaze's B2. Cheapest storage around. Hashbackup already does all that but it's closed source so who knows how that is done.

phime42 commented 8 years ago

Amazon Cloud Drive offers unlimited storage for just 50$ a year. Would be great if it'd be supported! :)

enkore commented 8 years ago

There's a FUSE FS for it: https://github.com/yadayada/acd_cli

That should work okayish (maybe not the best performance).

This thread here is about directly using the S3 key-value store as a backup target (no intermediate FS layer), at least that's how I understand it.

I think it's kinda unrealistic, at least for now, to completely redo the Repository layer. An alternative Repository implementation could be possible, but I don't see how you could do reliable locking with only S3 as the IPC, when it explicitly states that all operations are only eventually consistent. Parallel operation might be possible, but really, it's not a good idea for a first impl. Also, Repository works only on a chunk-level, and most chunks are very small. That just won't work. (As mentioned above)

Working on the LoggedIO level (i.e. alternate implementation of that, which doesn't store segments in the FS, but S3) sounds more promising to me (but - eventual consistency, so the Repository index must be both local and remote, i.e. remote updated after a successful local transaction, so we will actually need to re-implement both LoggedIO and Repository).

Locking: Either external (e.g. simple(!) database. Are there ACID RESTful databases, those wouldn't need a lot of code or external deps?) or "User promise locking" (i.e. 'Yes dear Borg, I won't run things in parallel').

Eventual consistency: Put last (id_hash(Manifest), timestamp) in locking storage or local, refuse to operate if Manifest of S3 isn't ==?

roaima commented 8 years ago

For what it's worth, I'm currently using borg on top of a Hubic FUSE-based filesystem for my off-site backups. It's painfully slow - my net effective writing speed is around only 1 Mb/s - but other than that works pretty well.

Issues as I see them

Writes have a very high latency. Once you're writing it's fast (10 Mb/s, intentionally limited within Hubic), but there seems to be a two second delay at the beginning of each file write.
Reads are reasonably fast. There's certainly nothing like the write latency but I've yet to turn this from an empirical value into a quantifiable one.
The process is slow, so avoiding inter-feature locking would be a very good thing. (borg list, and borg extract, specifically).

It might help to cache KV updates locally before writing them in a blast periodically, But I don't have any easy way of testing this. (It would be nice if there were a generic FUSE caching layer, but I have not been able to find one.)

enkore commented 8 years ago

Increasing the segment size in the repo config might help if there is a long-ish ramp-up period for uploads. (And increasing filesystem level buffer sizes if possible)

ThomasWaldmann commented 8 years ago

http://rclone.org/ maybe interesting as component for the cloud support plan.

anarcat commented 8 years ago

here's the most original solution I have heard yet for "cloud" backups with borg:

https://juliank.wordpress.com/2016/05/11/backing-up-with-borg-and-git-annex/

TL;DR: backup locally, then use git-annex (!) to backup to... well, anything. in this case, a webdav server, but yeah, git-annex supports pretty much anything (including rclone) and can watch over directories. I'm surprised this works at all!

InAnimaTe commented 8 years ago

Yeah so I've already went down the git-annex route through research and testing and it's extremely complicated. The way your suggesting is really dirty and tedious....git-annex is a whole other beast to learn. Really, users of Borg could already just rclone their backups to whatever cloud provider is supported by rclone (most of them). You'd only need to add git-annex if you're looking for even more versioning and/or encryption.

On Wed, May 11, 2016 at 11:40 AM, anarcat notifications@github.com wrote:

here's the most original solution I have heard yet for "cloud" backups with borg:

https://juliank.wordpress.com/2016/05/11/backing-up-with-borg-and-git-annex/

TL;DR: backup locally, then use git-annex (!) to backup to... well, anything. in this case, a webdav server, but yeah, git-annex supports pretty much anything (including rclone) and can watch over directories. I'm surprised this works at all!

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/borgbackup/borg/issues/102#issuecomment-218499047

Mario P. Loria (586) 258-6003 Co-Founder, Arroyo Networks https://arroyonetworks.com/

marioploria@gmail.com https://www.linkedin.com/in/mario-loria-32514b10a https://twitter.com/marioploria https://plus.google.com/+MarioLoria/posts https://github.com/InAnimaTe

anarcat commented 8 years ago

Well, that was the whole point wasn't it, encryption... Then again, it's unclear to me why they didn't use the built-in encryption.

The whole setup, even with just rclone, also has the problem that you have a local repository which takes local disk space. Obviously, this is not a complete solution for the problem here, but I thought it would be interesting to share nonetheless.

mnpenner commented 8 years ago

So.... if I rclone the entire borg repo to my favorite cloud storage provider, and then I later want to restore something, do I have to re-download the entire repo? And what if one or two "chunks" get corrupted, can I still recover the rest?

jtwill commented 8 years ago

https://github.com/gilbertchen/duplicacy-beta

It looks like duplicacy has most or all of the same features as borg-backup, and it also supports backing up to cloud storage like Amazon S3. Unfortunately, it is not currently open-source, and the development seems to happen behind closed doors.

The developer does share a design document so it is possible to get a general idea how it works. If I understand it correctly, the reason duplicacy is able to work with cloud storage is because it does not have a specialized index or database to keep track of chunks, but rather uses the filesystem and names the files/chunks by their hash.

DavidCWGA commented 8 years ago

It's super-shady that they're using github while not being open source.

jtwill commented 8 years ago

I agree that the way they are doing it is not very cool. I did not mention them to recommend their software, but rather because it seems that they have a good design. Since the design document is on github, others should be free to use the same design, no?

PlasmaPower commented 8 years ago

Also, if you need specifics, they forgot to strip their executable :P. It's built in go from what I can tell.

JensRantil commented 8 years ago

See also related #1070.

julian-klode commented 7 years ago

WRT the git-annex thing, I have to say I dropped that and am thinking about other solutions. The major issue being that you basically have to fetch everything back to restore the latest backup, as we'd otherwise need borg to be able to tell git-annex which files it needs. The WebDAV server I'm on also has 1TB of storage, but apparently there's a file number limit or something, because it stopped working at some point.

Another thing I wondered about is pruning: Doesn't that repack a lot of files, and would force me to retransmit a lot of data to the server compared to running prune on a server-side borg (append-only would probably be about 1GB per day, so I'd quickly fill up 1TB)? This would also be a issue for FUSE file systems.

I'm thinking about just running borg once for each location, instead of duplicating local borg backups, and just pay a tiny bit for an rsync.net account ($0.03/GB is OK) and/or re-use a few machines at my local sites.

ThomasWaldmann commented 7 years ago

@julian-klode if a bigger segment size would help you for that webdav server, you can set that in the repo config. borg 1.1 (beta right now) will create bigger segments.

Prune: you can't run prune server-side (except if you give the server-side borg the encryption key - you can do that if you trust the server, but borg doesn't trust the server by design).

anarcat commented 7 years ago

@rmoriz considering the size of your bounty, could you clarify if it is really dependent on a "generic storage interface layer" other than a FUSE filesystem? as things stand, the bounty is in the process of being claimed by @bgemmill in https://github.com/yadayada/acd_cli/pull/374, but that only covers a S3 over FUSE filesystem implementation.

it seems to me #1070 is the place for a more generic implementation (which I would be interested in working on, more than just S3), but there's no bounty on that other task, and even the amount here isn't quite enough to compensate for the time that would be needed to complete a more generic design...

reyman commented 7 years ago

Hi, add 5 dollar for this great bounty ! It works also with the unlimited storage solution from amazon (cloud drive) ?

anarcat commented 7 years ago

It works also with the unlimited storage solution from amazon (cloud drive) ?

I believe the Amazon drive cloud thing is a different API than S3. Even worst, the API is invite-only:

https://developer.amazon.com/amazon-drive

So I think it's out of scope for this specific issue here, but would be a good fit, again, for #1070...

reyman commented 7 years ago

@anarcat Why not, perhaps we can start a bounty on that ? I can give 5$ - 10$ on this. I'm really interested by automatic backup on cheap cloud backup solution like amazon drive offer.

anarcat commented 7 years ago

head over to #1070 then :)

bgemmill commented 7 years ago

With my PR on acdcli, you can use borg on amazon cloud drive right now. I've been using it for a month or so without issue.

You also don't need to sign up for an API as @anarcat mentioned; that's only if you're going to be creating your own access system like acdcli or a fork thereof. Just using it is fine.

As to the bounty, acd isn't technically s3, so I'm not sure if my work qualifies there. If the demand was merely to run borg on cheap amazon storage, you can do so now :-)

dave-fl commented 7 years ago

Any plans to to make this more general and support something like libcloud. An Amazon Cloud Drive or Backblaze option would be great. Rsync.net will probably be more cost effective than S3.

mhnbg commented 7 years ago

bgemmill wrote:

With my PR on acdcli, you can use borg on amazon cloud drive right now. I've been using it for a month or so without issue.

How did you do that? I mounted Amazon Cloud Drive with acd_cli, but get "assert transaction_id is not None" @ borg init, and "Invalid segment magic" @ borg create... Are there any special parameters to set or other software to install?

enkore commented 7 years ago

You probably have to apply his PR first.

milkey-mouse commented 7 years ago

Considering rclone was banned from ACD (they may ban similar automated backup programs too) and Amazon removed their unlimited storage option Amazon Cloud Drive might have lost a lot of its appeal for use with Borg...

mnpenner commented 7 years ago

Agreed. I'm dropping ACD. I'm really disappointed in them. They offered that unlimited plan after the OneDrive fiasco. If they really couldn't afford it, they should have learned from Microsoft's mistake. I think it was an intentional ploy to acquire users.

mhnbg commented 7 years ago

Mark Penner wrote:

Agreed. I'm dropping ACD. I'm really disappointed in them. They offered that unlimited plan /after/ the OneDrive fiasco. If they really couldn't afford it, they should have learned from Microsoft's mistake. I think it was an intentional ploy to acquire users.

Ditto. I'm just evaluating Google Cloud now. There's even a Linux client ("gsutil") in my distro's repository which has an "rsync" option. Looks promising.

-Matt

davetbo commented 7 years ago

Anyone tried using RioFS to mount S3 and then using that mount point for your Borg repo? I've used RioFS and it's pretty stable. Much better performance than yas3fs, in my experience.

ThomasWaldmann commented 7 years ago

@davetbo https://github.com/skoobe/riofs#known-limitations doesn't sound like it could work, but maybe just try, maybe docs there are outdated.

davetbo commented 7 years ago

Does borg append to existing files? I thought it either created or deleted the blocks, but never appended to them. Maybe it appends to other files, though.
Does it rename folders? I wouldn't know.
Does it expect "posix filesystem semantics?" I wouldn't know that either.

I will give it a try and post back with my results. In the meantime, if anyone else comes along and has any feedback on having tried it, maybe they'll share :)

borgbackup / borg

Borg backup to Amazon S3 on FUSE? #102