basho / riak_kv

Riak Key/Value Store
Apache License 2.0
652 stars 233 forks source link

Mass deletion, Automated deletion, Reaping tombstones #1725

Open martinsumner opened 5 years ago

martinsumner commented 5 years ago

In Riak deletion is difficult.

The default setting for delete, will replace an object in the backend with a new object in the backend, an object that is from the perspective of Riak a tombstone. It will then set a timer, and then on triggering the timer, the tombstone will be reaped (in essence deleted from the backend - which in turn might use a backend tombstone to defer the deletion). The temporary tombstone is a replicable object so other clusters can be informed of the deletion and perform the same deletion to reach a consistent state.

There have traditionally been two problems here:

These problems can be avoided by having a delete_mode of keep (which is considered to be the safe way to run riak - but isn't the default way), which makes the tombstones permanent. However making the tombstones permanent has a cost in terms of uncollected garbage. That cost is an impact on disk space consumption, but also on operations which depend on object folds (e.g. key-listing full-sync, handoffs, or AAE tree rebuilds). This uncollected garbage is an issue both in the primary store and any AAE store (tombstones are objects and exist as such in AAE).

Further information on deletes:

So deletion of individual objects is hard, what about mass deletion? Can we delete without discovering objects which need deletion in the application, and deleting them one at a time using batch jobs?

There has has traditionally two mechanisms for mass deletion in Riak:

There are some problems with these approaches though:

There are some new tools available for solving these problems:

There is a general set of superficial nice-to-haves, features that now seem to be possible to implement:

There are some general problems though with making such improvements:

Change here is going to be a difficult balancing act. I think we need to consider change in two contexts:

martinsumner commented 5 years ago

Proposal 1.

The first solution to the problem of mass deletion is to extend the aae_fold find_keys query so that it can be requested to return [{Key :: riak_object:key(), IsTomb :: boolean()}] tuples.

This would allow for keys beyond a certain last modified date in a bucket to be discovered for deletion, and allow for a backend-independent way of maintaining an effective TTL by an application, without requiring multi-backend configuration. Also, as each object would then be deleted via Riak on TTL expiry, AAE consistency is already handled.

This requires the following changes:

This proposal could be extended by adding a TTL to the bucket properties, which would be:

The extension to the proposal would then give a Riak-managed per-bucket TTL (as opposed to na application-managed one). Results would still not be guaranteed to be consistent with the TTL, as:

Migration from backend TTL to the new TTL mechanism would be straight forward, though it would depend on enabling Tictac AAE, and managing the extra cost of this. It would then be possible to deprecate the use of backend TTL in clusters using AAE.

This would not be a more efficient answer than backend TTL, but it would be a more predictable answer (especially where backend TTL is applied along with anti-entropy).

martinsumner commented 5 years ago

Proposal 2.

The second proposal is to have a bucket property that can set a TTL for keep tombstones. That is to say we can ask for tombstones to expire eventually.

This wouldn't use a direct timer method as in existing delete, but would allow for a long timer to be set (e.g. 30 days), far beyond a point of which we might expect any tombstone to have not been fully replicated by one mechanism or another.

The implementation of this, could be:

It might be that neatness and efficiency might not be such a big requirement here. Keeping tombstones is not a massive overhead, and perhaps it might not be enough to allow and administrator to schedule a "reap sweep" for a period of inactivity. It might be nice to have a lot of stuff happen automatically and magically under the hood - but the reality is that it only first needs to be better than the solution we have now. As long as we can eventually reap tombstones if an excess of tombstones becomes a problem, we have advanced the current state.

martinsumner commented 5 years ago

For those with an interest in the problem from a theoretical perspective - this is an interesting piece of research and well worth a read.

This is not implementable for Riak (too radical a change). It isn't a perfect answer (it makes an assumption that persisted data is not lost from disk - which is not something we like to assume in Riak). It is though, a very interesting change in overall approach.

martinsumner commented 4 years ago

The current draft branch for 2.9.1 includes:

riak_kv_reaper; riak_kv_eraser; new aae_folds to trigger reaps and erases.

This allows an application to run aae_folds to either reap a set of tombstones, or delete a range of keys that have not bene modified since a given date. The former is intended to provide a mechanism to reduce the long term cost of running a cluster in the keep mode. The latter is proposed as an alternative to backend TTL - in that it allows the expiry of objects to be managed without creating a discrepancy with AAE stores. This is similar to the original Basho 2.2.5 proposal for riak_kv_sweeper - only in our case efficiency is gained through folding over only heads using the TictacAAE store rather than performing multiple fold actions per fold.

The feature will work on any store with TictacAAE enabled. In order to reap tombstones then tictacaae_storeheads must be set to enabled when running in parallel mode

martinsumner commented 3 years ago

https://github.com/basho/riak_kv/pull/1749