dpc / rdedup

Data deduplication engine, supporting optional compression and public key encryption.
833 stars 43 forks source link

Genertaional GC behavior w/ rsync #172

Open pfernie opened 4 years ago

pfernie commented 4 years ago

For my setup, I typically create backups locally and then rsync them to NAS/remote storage. I also routinely prune older names, and run gc to clean up old chunks. However, this seems to not play well with rsync; with the generational GC behavior, each gc creates a new generation file structure, and as a result the rsync effectively copies the entire repo each time after a gc.

Looking at the generational gc behavior, I believe it is effectively designed to only ever anticipate at most 2 generations: the old and new. However, each new generation is labeled incrementally (00..01-<rand>, 00..02-<rand>, etc.). Given that the implementation only operates on 2 generations, would an alternative implementation make sense:

By reducing the naming churn, rsync operations can behavior more cleanly; if an rsync is not done while a migration is in progress, typically only the 00-cur generation will exist and will be trivially sync'd under usual rsync logic. If an rsync is done during a migration/an interrupted migration, "extra" syncing will occur, but this properly captures the "in progress" state of the gc.

dpc commented 4 years ago

Oh. Interesting. The current scheme is designed to support multiple hosts syncing with something like dropbox/syncthing, potentially often offline for long periods of time. These usually can handle renames very efficiently, but don't like conflicts. I never considered rsync.

It seems to me that for rsync support it would be better to just devise some additional scheme altogether. Example: using hard links to create a cheap copy of all the alive files.

During your backup just merge all the generation dirs into one directory:

This will effectively remove the 0d-xxx paths from all path names and make a direcotyr that appears stable to rsync.

I also wouldn't mind merging a PR that adds a --export-stable-shallow-copy <path> option that would do the copy for you, so then your backup script would be: rm -rf <shallow_copy_dir>; rdedup <opts> --export-stable-shallow-copy <shallow_copy_dir>; rsync <shallow_copy_dir> ...

pfernie commented 4 years ago

That makes sense. For my needs, perhaps a simpler flag to gc would suffice: --echo-current-gen. This would print the name of the remaining generation after a complete gc run. This would ignore unpruned generations that have names (e.g. younger than grace_time). This behavior would mean you need a fully gced repo, but that is my expected use case.

This would facilitate handling most of the shallow copy externally via a script. It could call gc --echo-current-gen, and store the reported generation name. The script could then do a simple cp -al <repodir>/<generation-name> <stable-dir>/stable, as well as hardlinking the config.yml for the repo, and then rsync <stable-dir>.

Edit: I am happy to submit a PR for this approach, if it seems desirable.

dpc commented 4 years ago

Do we really need to print anything? Wouldn't having a file/symlink pointing to the most recent generation be a better API than console output?

Other than the above, I'm happy to land something that unblocks your use-case.

dpc commented 4 years ago

Maybe it could be a separate command, so then you can:

rdedup gc && rdedup get-last-gen-path > last-gen-to-backup && rsync $(cat last-gen-to-backup) ...

Kind of thing. The names and exact commands could differ.

geek-merlin commented 3 years ago

I'm interested in this too. My usecase is to have a local repo that is rclone'd to cloud storage. The local and the cloud repo will have different prune/gc strategies (the remote will retain more backups, the local maybe just one).

pfernie commented 3 years ago

This has obviously be dormant a while; in my particular case I've been simply not gcing to avoid the "naming churn". Mainly because I did not have strong opinions on the direction an alternative could take.

There are some things one can do externally to the rdedup command, as described in this thread, but generally I think it might be worthwhile allowing for a "simplified generations" scheme, configurable for a repo, which would follow the scheme outlined prior of having just two generations (cur and old). For me, at least, the old generation only really exists in the case of an interrupted/in-process gc. That being said, I'm unclear on the utility of having more than 2 generations, other than some backends/sync schemes don't care about naming so it doesn't make a difference for them. That is, I think the cur/old scheme could support all cases, and in reality outside of interrupted/in process gc calls most sync procedures will only see a single generation at a time.

So I'd be happy to help with a variety of approaches, depending on what direction makes sense for a majority of use cases.

geek-merlin commented 3 years ago

I have thought quite a while about this. Find a stab on this in #190.

geek-merlin commented 3 years ago

(and with this today's ticket work session ends, have fun, good nite! ;-)