dpc / rdedup

Data deduplication engine, supporting optional compression and public key encryption.
832 stars 42 forks source link

Rework garbage collection to keep repo directory #190

Open geek-merlin opened 3 years ago

geek-merlin commented 3 years ago

Problem

In #172, it was discovered that the changing repo dir name causes several problems.

Refs

GC was updated in #32 #75 #132

Currrent situation,

Let's first state the current situation and how it evolved, as reality check.

GC checks for every chunk, if it has a direct or indicŕect reference from a name, and only keeps the referenced. There are sophisticated ways to do that without locking, so backup and GC can safely act at the same time, but for the sake of simplicity, we are using a suitable form of locking (creating a lock file in the repo). (But note that #188 might benefit from no locks.)

Some time ago, the GC reference checks had been done in a memory based Key-Value store (in fact we only need the keys), but this soon gets huge and does not scale. So in #132, this was changed to a method where chunk files were moved when a reference was seen, so the repositlry itself is used as a key-value-store.

While this has scalability wins, it also brings us some drawbacks:

Proposal

Let's combine the best of both approaches and use a scalable KV store like this:

GC service:

The trashlist service can be any scalable (non-memory-only) KV store (we only use keys, and leave value empty). Let's e.g. start wigh the caves crate and its local file backend, and evaluate its rocksdb backend and maybe other options in a followup.

dpc commented 3 years ago

Using remote cloud storage as KV store is bound to be extremely slow

I don't understand this one. Is it because KV store doesn't have a rename (changing the key) and only add/remove?

The current GC is the way it is because it originally allowed multiple clients on different desktop computers to do whatever they need to do, and sync later over dropbox/syncthing. That was the original use case I wrote rdedup for (since that's how I tend to use my desktop/laptops: setup syncthing everywhere, and use it as a free sync & backup). I don't think the proposed solution necessarily works in that case, and it also has operational issues (user has to setup some additional

With time people pop up with all sorts of different usecases, including industrial level backups, and even trying to make rdedup disk-like storage and so on .

So, we're reaching the point where rdedup would become too many things for too many different use-cases, and I'm not sure what to do about it.

Possibilities:

geek-merlin commented 3 years ago

Yes i wanted to make the use-case thing explicit by #189. Sorry if that feels overwhelming. But i am quite confident that - with reasonable tradeoffs - rdedup can stay as lean as it is, which is the reason i chose to put some energy into it (sorry no rust coding for now...).

geek-merlin commented 3 years ago

As of this proposal, i studied your comments in #172 and want to find a way to solve that. If you bear with me, maybe we can come up with something.

The current GC is the way it is because it originally allowed multiple clients on different desktop computers to do whatever they need to do, and sync later over dropbox/syncthing. That was the original use case I wrote rdedup for (since that's how I tend to use my desktop/laptops: setup syncthing everywhere, and use it as a free sync & backup).

Yes, that's my use case too, but with rclone.

I don't think the proposed solution necessarily works in that case,

Can you elaborate? It is not so far from what you said might work in #172. It collects a list of files to delete, and then deletes them.

and it also has operational issues (user has to setup some additional ???

(Hmm, you did not finish the sentence, and i'm unable to guess)

geek-merlin commented 3 years ago

Or, let me elaborate the process, maybe this helps.

(As i understand it,) GC as described in #132 maintains 2 directories with magic names, and in #172 you said if it helps we might rename them to cur and old. Everything goes to old, and what is used, is copied to cur. In the end, old is deleted. There is locking to prevent interfering with backups.

The above breaks doen to: There is only cur. When we start, we create a local temporary directory, and in it for each cur chunk an empty file in tmp-old. For each used chunk, we delete its placeholder in tmp-old (we could copy to tmp-cur, but we won't need it anyway). In the end, for every left over placeholder in tmp-old, we delete the corresponding chunk. And in the end, we did approximately the same as above, but the problems in #172 and additionally those i mentioned in the IS won't manifest. Or in other words, nothing bad can happen as the effects are the same as in current GC, but som bad things wont happen.

For the syncthing case, it's less file renaming, and for the rsync (@pfernie) and rclone (@geek-merlin ;-) case, it's relief of file re-copying.

(But i'm only proposing this, due to my lack of rust-fu, it will only happen if @pfernie from the other ticket finds this convincing and gives it a shot.)

pfernie commented 3 years ago

I would also like to keep the implementation here simple, and not fall in the trap of trying to support "all use cases". But I do think there is a scheme that remains simple and shaves some of the warts. @geek-merlin has already linked and summarized some history, but from my understanding the current generational scheme is an elegant solution to the needs of incremental GC, as state is tracked in the filesystem. But outside this implementation detail, generations are not helpful for any use cases in and of themselves.

To run with the idea @geek-merlin sketched out here, here is an approach that hopefully retains the elegance of the current scheme, but can avoid the generational implementation.

On invoking gc:

  1. Enumerate the list of chunks, recording them in a statefile gc-chunks. The store is locked during this enumeration, so no new names or chunks can be created while creating and recording the list. It may be useful to also store some additional metadata, e.g. gc_start_time, etc.
  2. Presence of this file indicates a GC is already in progress; if a gc is initiated and this file exists, it is loaded so the in-progress GC cycle can continue.
  3. For names, a new piece of metadata is added, field gc_generation. This field is optional (and therefore backwards compatible with current .yml files), and its absence in the .yml indicates that name has not participated in any GC cycles yet. Newly created names can therefore continue to omit this field (and removing this value, if defined, from a .yml will not break GC behavior).
  4. Enumerate and read the metadata for all names, finding the largest generation number across all (we consider generations as monotonically increasing). If none is defined, the oldest generation is 0. Under the implementation, the expectation is that there are at most 2 different generation numbers seen, and that they are 1 generation apart (e.g. 1 and 2, or 2 and 3). However, this is not technically a necessary invariant.
  5. The new generation (new_generation) is then simply the increment of the largest found generation.
  6. gc then traverses the current set of names (whether this is a new GC or resumption of an in-progress GC), and for each name with a generation younger (less) than the target new_generation (or no defined gc_generation), for each chunk reachable from that name remove that chunk for gc-chunks if present.
  7. When all chunks for a name have been processed, the metadata (.yml) for that name can be updated to set its gc_generation to the new_generation value, thereby recording its completion. It is important that this metadata is only modified after the gc-chunks statefile is updated; in the case that the operation is interrupted in between (gc-chunks is updated, but the process ends before updating the name metadata), the store is in an acceptable state. Resuming the gc will simply process that name again, which will have no effect since any chunks it references have already been removed from gc-chunks.
  8. Finally, when all names have graduated to new_generation, the GC is considered complete. Any chunks remaining in gc-chunks can now be deleted. The store should be suitably locked for this cleanup phase, so no new names or chunks are created.

I believe this approach supports the contemplated use cases. It is not an explicit intent of this approach, but it seems this does also facilitate GC being carried out on multiple hosts for a store. That is, for a store synchronized between multiple hosts, initiating a gc on one host, interrupting it, and then resuming it on another host (assuming the store is synchronized appropriately). That being said, I have not given this usage much thought, so there may be subtleties here I am overlooking (perhaps warranting something like a "GC lock"?).

Some additional comments:

dpc commented 3 years ago

The store is locked during this enumeration, so no new names or chunks can be created while creating and recording the list.

I don't think you understand the original use-case well.

I have 2 or more laptops. There's dropbox/syncthing syncing between them the rdedup repo. But they are not connected with each other at all times. The network can be down, syncthing can be down, laptop can be turned off, etc. Each can be doing anything they want - GC, backup, etc, doesn't matter. And when connectivity is restored, they will do dumb file sync and everything will still work, no data can lost, and no file conflicts ever happen. That's what the current scheme was designed to support. Before actually deleting any files, there's a timeout (in days), to allow time for sync. All that is required is that all hosts sync everything at least that frequently. If any host introduced files to the old generation that was GCed on another host, from all hosts perspective, it looks like the incremental GC was not actually completed. On the next GC run new files in old generation will be brought into the new generation re-completing the garbage collection, and timeout will be restarted. Eventually all hosts will be aware of the new generation and write to the latest generation directly, and old generation(s) will be considered stale and all files in it will be actually removed.

I'm quite fond of this scheme, because it's a fully distributed p2p garbage collection system, where everything is idempotent and immutable.

No "locked during this enumeration" can work in such a use-case, because there is no locking possible. And there can't be any mutable files, because dropbox/syncthing can not synchronize between concurent changes to the same file.

Now, after this was all working, there came the use-case of "remote storage". So instead of having N copies of all files, synchronized with dropbox, there's now just one copy remotely accessed by all hosts. It's a completely different scheme. And there locking (of the storage) was added and development went forward. But personally I'm still syncing with syncthing and using rdedup to scratch my original use-case niche.

Now, having said that I kind of read through the design and it's like 90% the same as what's currently being done. There's just no gc-chunks file. GC just picks a name in old generation, and starts atomically moving all it's data to the new generation. The data itself is keeping track of the progress. Once all the chunks of a name are migrated, the name itself is atomically moved. And that's basically the progress bar. There's really never any reason to keep track of all the chunks at once.

dpc commented 3 years ago

The only weakness of the current scheme that I'm aware of is that no data can be ever mutated in place, so things have to change names This is not a problem for dropbox and syncthing because they quite naturally have good detection of renames. However stuff like rsync is a problem.

https://lincolnloop.com/blog/detecting-file-moves-renames-rsync/

I really think that hardlink export and import would be the easiest way to solve this.

rdedup hardlink-export would create a cheap clone of the state of the repo that would have all the chunks in generation-agnostic names that would not change ever, this export could be rsynced to another machine, which could use rdedup hardlink-import to hardlink the new data back into the repo itself. Just in case: having a full copy of all files via hardlink is almost free, and fast. So I think that scheme is very practical. And code-wise it's probably 100-200 lines of code total in rdedup.

pfernie commented 3 years ago

First of all, thanks for taking the time and effort to review and comment on these suggestions. In writing this up, a new issue occurred to me, which I opened separately as #191.

In regards to this thread: you're right, I had an incomplete view of the motivating use case. I had drawn a wrong conclusion that in practice there would only ever be two generations (old and cur) in progress. But, in the scenario you describe, you can in fact end up with more than two generations in flight if both (or several) synchronized hosts start GC cycles independently and are then synchronized before the GCs are complete. In fact, it seems you could have the "same" generation occurring on multiple hosts, with the random suffix distinguishing them; in practice, the lexigraphically "last" one ends up being considered the newest. On top of that, the issue of not being able to handle concurrent modification makes the use of statefile or similar mutable state a problem.

The hardlink solution may be the most straightforward, but I had been exploring these designs as I prefer the idea of being able to dumbly sync in all cases (be that under the syncthing or rsync "style"), rather than having to introduce additional import/export steps (and having to think about whether goofing up those steps might cause consistency problems.

I have some thoughts on adapting the current scheme; they would probably involve altering the store structure a bit, so maybe still not desirable relative to an option like hardlink. But in thinking about it, #191 occurred to me, and it is probably better to address that first if it is a problem before further discussing alternative mechanisms.

dpc commented 3 years ago

But, in the scenario you describe, you can in fact end up with more than two generations in flight if both (or several) synchronized hosts start GC cycles independently and are then synchronized before the GCs are complete. In fact, it seems you could have the "same" generation occurring on multiple hosts, with the random suffix distinguishing them; in practice, the lexigraphically "last" one ends up being considered the newest.

Yes. That's exactly what happens and how it works. I think we're in sync now. :)

geek-merlin commented 3 years ago

Sorry for being off so long. This issue is still important to me, but "the house was on fire"...

Thanks a lot @dpc to take all the time and explain your thoughts and use case. Whe i find time i will add that to the "use cases" issue. To be clear: I did not start that issue to mandatorily cover all use cases (though it might turn out). In fact i wanted to get an overview of what people may want (and i want) and refer to that so we can clarify hidden assumptions.

geek-merlin commented 3 years ago

That said, i still have the hope that we can find a GC strategy that works for all possible scenarios.

Just to get everyone onto the same boat: @dpc, we talked about duplicacy's Lock Free Deduplication in another issue and you said it's similar to the current strategy. @pfernie : Do you know and grok that? My hope is that we find inspiration in it that advances this.

dpc commented 3 years ago

I doubt there's going to be one strategy for all use-cases, but my bet would be that 2, max 3 would be all that is ever needed, and they would be fairly invisible for the rest of the system.