basho / riak_kv

Riak Key/Value Store
Apache License 2.0
651 stars 233 forks source link

riak_kv_sweeper with alternate backend capabilities [JIRA: RIAK-3404] #1649

Open martinsumner opened 7 years ago

martinsumner commented 7 years ago

The leveled backend is designed to make scanning across partitions to find subsets of key metadata much more efficient due to the separation of keys/metadata from values. For example scans for Keys/Clocks to support hashtree rebuilds, scans to assess the total number of keys, or total object sizes (the object size is kept as metadata).

However, moving hashtree rebuilds to riak_kv_sweeper seems to bake in the use of fold_objects to find this information (as although the rebuild may not depend on the whole object, other participant functions may do).

I've added a leveled issue with some consideration of how to address this - how should riak_kv_sweeper/leveled be changed so that there is no efficiency loss from folding across values when the information about the values is not required.

https://github.com/martinsumner/leveled/issues/59

I'd appreciate some feedback from Riak developers on what they consider the best answer to be. Also, as I'm not really aware of the plans/intentions for riak_kv_sweeper - so more information on the expectations of how this would be used would be helpful to inform any decision.

Currently I'm likely to proceed with the fold_heads approach, storing a proxy to the value within the riak_object, partly for expediency to get to a volume test faster to validate the potential advantages of the approach to backend design.

nickelization commented 7 years ago

Hi Martin,

This is a good question. So in some instances, the sweeper may combine multiple sweep callbacks into a single fold, if two callbacks are scheduled to run around the same time. In this case, if even one of the sweep callbacks requires the object data, then we're going to have to do a fold_objects call anyway, so having a separate fold_metadata call (or whatever we might choose to name it) won't be able to save us anything.

However, I think we could add a new option flag to sweeps where we specify that we only need metadata. The sweeper could then look at whether all the callbacks are metadata-only, and if none of the sweep callbacks it's about to run need the full object, it could do a metadata fold. (In the case of a mixed fold where some of the callbacks do need the full objects, we could still extract just the metadata to feed to the metadata-only callbacks so that we don't need to write two versions of the callback for the different fold types.)

This seems like it would give us the best of both worlds, though I am curious how leveled would perform on a full fold_objects sweep, since it stores the object data separately from the metadata. (It's been a while since the architecture review, and my memory is a bit rusty, so I can't remember how this particular operation would work.) If a full object fold performs poorly under leveled, we could always suggest avoiding sweeper plugins that require that, but it would be a bummer if we couldn't use features like object TTL or tombstone reaping without falling back to leveldb.

In regards to your question about future plans for the sweeper code: it should be included in the next major release of Riak (though I actually left Basho a couple weeks ago, so don't take my word as authority; plans could have changed). It's not meant to be a documented API with a stable interface, though: it's really just meant to be internal to Riak, so modifications like the one I suggested above should be easy.

Thanks, Nick

martinsumner commented 7 years ago

Thanks Nick.

fold_objects would perform relatively poorly if it were to be done in leveled, as in the current implementation the fetching of the values would be a non-sequential read. That can be changed, but this would mean that fold_objects would not return in key order - and I'm not sure if anything depends on that.

The interesting things about the other things sweeper might need fold_objects for, is that they may already be a native feature of leveled. For example, leveled already supports per object TTL, reclaiming space through the existing compaction process (so this wouldn't need a sweep). Although I've not proven in implementation yet, I believe I have a native approach to tombstone reaping as well.

Perhaps a participant function should have the option to provide a list of backend capabilities, whereby the fold should not be run if that list is present, and a Mod:capabilities check can be made when the fold is about to run. So sweeps run on leveledb backends, not not on leveled backends where it is currently easier just to make the feature part of the native design.

I have an implementation I'm ready to test of hashtree rebuilds with fold_heads & leveled, so will add more details in once this is tested.

Thanks

Martin

P.S. Sorry to hear you've left Basho, good luck wherever you're going, and a special big thank-you for taking the time to answer anyway.

nickelization commented 7 years ago

Thanks! I may not be a paid contributor anymore, but it's still an OSS project, and I'd still like to see it succeed :). Definitely also interested in leveled, so I'm happy to field questions like this too.

I'm not sure if anything depends on fold_objects returning values in key order, but it seems like it shouldn't, because we do support non-sorted backends (e.g. Bitcask). I think most places I've seen us depending on sort order of keys has been in code that directly uses leveldb, rather than going through the riak_kv backend interface (e.g. the AAE hashtree code). More research is probably needed to confirm my hunch though.

We have done implementations of TTL in the backend before, both in Bitcask and LevelDB, but there are some inherent limitations to this approach, which is part of why we created a separate implementation in the sweeper. The trouble is, if values expire directly in the backend without Riak having any higher awareness of it, you can end up in all sorts of tricky edge cases where a value expires on one replica, but then gets read repaired back to life before it expires off of other replicas. Then when you start adding in multi data center replication and the like, the problem just gets worse.

Having TTL implementations in the backend is still useful, since some applications can accept these sorts of edge cases, and it can be much more performant to do this work in the backend (especially since, like you mention, you can do it efficiently as part of the existing compaction process). But, that said, I think the sweeper implementation (or something like it) is going to be necessary for the foreseeable future if we want to be able to provide a reliable TTL mechanism without the weird, unpredictable behavior that backend TTL schemes tend to trigger.

martinsumner commented 7 years ago

OK, I did try to do some reasoning about reappearing objects as I was implementing the TTL, but it may well have been naive and incomplete.

If we assume a kv_sweeper implementation is necessary, I'm still not sure it needs to be fold_objects in leveled. The leveled keystore contains keys and metadata, and for Riak objects this contains a full Riak object binary - but with the #r_content values removed, and only the content values removed.

So leveled has two stores: the journal, and the ledger. The journal contains the full object (using CDB files) with a Key based on a combination of the Object Key and the sequence number. The ledger is a LSM-tree and maps Keys to a value which is some metadata specific to leveled (including the sequence number, and the size of the original object) and the the original Object binary stripped of the values, but still including all the vclock information and metadata for each sibling.

So we can efficiently fold_heads (across the Ledger in key order) returning Riak keys and value-less objects, and much less efficiently fold_objects (across the Ledger, fetching each value one-by-one from the Journal). If you get fold-heads it you get returned a {B, K, ObjBin} just like in fold_objects, and the ObjBin will behave exactly the same as the original object after being processed through riak_object:from_binary as long as you only try and get clock/metadata/sibling-count style information, and don't try and get at the value.

Tombstone reaping, object expiry, still seem to me like things that can be managed from the value-less riak object - surely they don't care about anything other than object metadata? So fold_heads only is required?

nickelization commented 7 years ago

Ah okay yeah, that's a good point. If the last-modified tags can be accessed in the metadata returned from fold_heads, then those sweeper plugins should work fine without needing fold_objects. So really, out of all the sweeper plugins we've implemented so far, I think fold_heads should suffice for all of them. Awesome :)