Restore remote data inplace with deduplication or rsync algorithm

oderwat commented 9 years ago

I copied this over from the Attic Issue https://github.com/jborg/attic/issues/322

Using Attic to snapshot multiple GB of data to a remote server leaves me with the problem that restoring the data to a former state needs a full extraction of all data traveling the network.

I could extract on the remote server and then rsync the data back to the source but that would need full extraction (time and space) and also decodes the backup data on the remote server which renders the AES kinda obsolete.

My current best options seems to mount the data on the remote and then use rsync with that. But this needs the secret key on the backup server which I would like to avoid. It also introduces multiple points of failure and complexity.

What about having an extract mode which uses the already existing local data for an in place diff recovery? This could make it pretty cheap to restore to an older state!

I guess this could be one with the existing de-duplication algorithm as it is kinda "reverse backup".

It would also to allow snapshots on a remote server and extraction to another system which "mimics" the original. Last one is my use-case: We could let the servers create backups every some hours and restore that state easily onto a developer machine. Even if that snapshot is multiple gigabyte large and the developer has no access to the original server. It would be fast because most data is already on the developer machine. Basically it would just restore a "diff" of the last extracted snapshot.

ThomasWaldmann commented 9 years ago

You can locally fuse mount the remote repo, so you don't need the key on the server. Try what happens if you run rsync on that. It likely will eat a lot of temporary disk space for the cache of transferred chunks. If you're lucky, rsync won't completely read files from fuse if the metadata looks "unchanged" (depending on rsync params).

oderwat commented 9 years ago

I did that.. but it is no solution. This sends all data over the line.. we talk gigabytes here. RSync typically reads all data on both sides for its rolling checksum. It only works "ok" if you mount on the remote and rsync with this. Then rsync is running on both systems negotiating the data which need to be transfered.

ThomasWaldmann commented 9 years ago

Shouldn't rsync skip the file if mtime and size is same? For modified files, yes, I see the problem.

About your suggestion: it's a good idea, but we need to refine it.

A) same file local as in repo Using the same algorithm as when creating a new archive (check size, mtime, inode against the file cache), we can easily skip them.

B) file present locally, but not in repo We will notice this case when doing the check as in A) and it results in a lookup error. We need to delete the local file then (could be optional). Of course, deleting potentially lots of files (e.g. by some path mismatch) is dangerous.

C) local file and repo file differ (or local file not present) We need to get the chunks to assemble the file as it is in the repo from somewhere. We want to avoid to transfer chunks from (remote) repo if we can get them from local data. If local file is present: We could add the local file to a LocalCache temporary repository, then delete the local file. If no local file is present: Now build the file as in the repo: do a normal repo restore, but use chunks from LocalCache, only fetch the chunks from (remote) repo we do not have in the LocalCache. See existing RemoteCache class, that does sth. similar for FUSE mount.

D) deal with all non-regular files (devices, symlinks, ...) Just restore them all from repo, it's not much usually (no content chunks here). If a dir is present locally, but not in repo: rm -rf dir (should contain only empty subdirs after B).

E) deal with directories Set dir metadata - this needs to be very last because of dir mtime.

F) Cleanup Delete the LocalCache repo.

Note: LocalCache might be rather large, like a borg backup of all locally modified files.

oderwat commented 9 years ago

The files are most likely modified at the end (mostly database files) and some of them are really big.

A-F all look good. Would be cool to have!

I also think that no other backup-tool does that?

I am not sure about bup though. Which can do "pull" backups (which is also nice to have) and means you can pull over the net and restore local saving bandwidth.But then the repo needs to be on the developers machine and can't be central.

oderwat commented 8 years ago

I still think that this would be a very good and probably even "killer" feature and would be glad to help with the implementation. I am not sure how much work that would be. Any estimation?

ThomasWaldmann commented 8 years ago

It would be quite some work. I can add that B) is somehow a problem because there is no "directory with contents" metadata item in the archive. We only have the full path+filename stored in each file item separately, so the only time we could know what's in a directory (and what not) is at the end of a backup/restore, when all files were processed. There is already a ticket about that, so solving that would be a first step and also useful for other purposes. It would change the repository format.

isadon commented 6 years ago

@ThomasWaldmann was there any progress on this issue? Been awhile since an update.

ThomasWaldmann commented 6 years ago

No news here.

borgbackup / borg

Restore remote data inplace with deduplication or rsync algorithm #95