Open pfernie opened 4 years ago
Oh. Interesting. The current scheme is designed to support multiple hosts syncing with something like dropbox/syncthing, potentially often offline for long periods of time. These usually can handle renames very efficiently, but don't like conflicts. I never considered rsync.
It seems to me that for rsync support it would be better to just devise some additional scheme altogether. Example: using hard links to create a cheap copy of all the alive files.
During your backup just merge all the generation dirs into one directory:
This will effectively remove the 0d-xxx
paths from all path names and make a direcotyr that appears stable to rsync.
I also wouldn't mind merging a PR that adds a --export-stable-shallow-copy <path>
option that would do the copy for you, so then your backup script would be: rm -rf <shallow_copy_dir>; rdedup <opts> --export-stable-shallow-copy <shallow_copy_dir>; rsync <shallow_copy_dir> ...
That makes sense. For my needs, perhaps a simpler flag to gc
would suffice: --echo-current-gen
. This would print the name of the remaining generation after a complete gc
run. This would ignore unpruned generations that have names (e.g. younger than grace_time
). This behavior would mean you need a fully gc
ed repo, but that is my expected use case.
This would facilitate handling most of the shallow copy externally via a script. It could call gc --echo-current-gen
, and store the reported generation name. The script could then do a simple cp -al <repodir>/<generation-name> <stable-dir>/stable
, as well as hardlinking the config.yml
for the repo, and then rsync <stable-dir>
.
Edit: I am happy to submit a PR for this approach, if it seems desirable.
Do we really need to print anything? Wouldn't having a file/symlink pointing to the most recent generation be a better API than console output?
Other than the above, I'm happy to land something that unblocks your use-case.
Maybe it could be a separate command, so then you can:
rdedup gc && rdedup get-last-gen-path > last-gen-to-backup && rsync $(cat last-gen-to-backup) ...
Kind of thing. The names and exact commands could differ.
I'm interested in this too. My usecase is to have a local repo that is rclone'd to cloud storage. The local and the cloud repo will have different prune/gc strategies (the remote will retain more backups, the local maybe just one).
This has obviously be dormant a while; in my particular case I've been simply not gc
ing to avoid the "naming churn". Mainly because I did not have strong opinions on the direction an alternative could take.
There are some things one can do externally to the rdedup
command, as described in this thread, but generally I think it might be worthwhile allowing for a "simplified generations" scheme, configurable for a repo, which would follow the scheme outlined prior of having just two generations (cur
and old
). For me, at least, the old
generation only really exists in the case of an interrupted/in-process gc
. That being said, I'm unclear on the utility of having more than 2 generations, other than some backends/sync schemes don't care about naming so it doesn't make a difference for them. That is, I think the cur/old
scheme could support all cases, and in reality outside of interrupted/in process gc
calls most sync procedures will only see a single generation at a time.
So I'd be happy to help with a variety of approaches, depending on what direction makes sense for a majority of use cases.
I have thought quite a while about this. Find a stab on this in #190.
(and with this today's ticket work session ends, have fun, good nite! ;-)
For my setup, I typically create backups locally and then rsync them to NAS/remote storage. I also routinely prune older names, and run
gc
to clean up old chunks. However, this seems to not play well with rsync; with the generational GC behavior, eachgc
creates a new generation file structure, and as a result thersync
effectively copies the entire repo each time after agc
.Looking at the generational gc behavior, I believe it is effectively designed to only ever anticipate at most 2 generations: the old and new. However, each new generation is labeled incrementally (
00..01-<rand>
,00..02-<rand>
, etc.). Given that the implementation only operates on 2 generations, would an alternative implementation make sense:00-cur
and01-old
gc
:00-cur
and01-old
exist, and01-old
still contains names. In such case, continue migration.01-old
does not exist or contains no names.01-old
exists but contains no names, remove it (would typically have been removed at end of prior completegc
run).00-cur
to01-old
00-cur
01-old
to00-cur
per existing logic.gc
migration (no names remaining in01-old
), remove01-old
containing only unreferenced chunks.By reducing the naming churn, rsync operations can behavior more cleanly; if an rsync is not done while a migration is in progress, typically only the
00-cur
generation will exist and will be trivially sync'd under usual rsync logic. If an rsync is done during a migration/an interrupted migration, "extra" syncing will occur, but this properly captures the "in progress" state of thegc
.