Mass deletion, Automated deletion, Reaping tombstones

martinsumner commented 5 years ago

In Riak deletion is difficult.

The default setting for delete, will replace an object in the backend with a new object in the backend, an object that is from the perspective of Riak a tombstone. It will then set a timer, and then on triggering the timer, the tombstone will be reaped (in essence deleted from the backend - which in turn might use a backend tombstone to defer the deletion). The temporary tombstone is a replicable object so other clusters can be informed of the deletion and perform the same deletion to reach a consistent state.

There have traditionally been two problems here:

If all stages of the process are not completed successfully the original object may re-appear, which may itself present data loss risks (if the re-appearing object appears to be an update on an object which replaced a non-existing object) not just a data retention issue.
Deletes are often done in batches by applications, and there are overheads with triggering many timers concurrently, so the time between tombstone creation and reaping is kept short (when due to the other problem it would be better to keep it long).

These problems can be avoided by having a delete_mode of keep (which is considered to be the safe way to run riak - but isn't the default way), which makes the tombstones permanent. However making the tombstones permanent has a cost in terms of uncollected garbage. That cost is an impact on disk space consumption, but also on operations which depend on object folds (e.g. key-listing full-sync, handoffs, or AAE tree rebuilds). This uncollected garbage is an issue both in the primary store and any AAE store (tombstones are objects and exist as such in AAE).

Further information on deletes:

So deletion of individual objects is hard, what about mass deletion? Can we delete without discovering objects which need deletion in the application, and deleting them one at a time using batch jobs?

There has has traditionally two mechanisms for mass deletion in Riak:

Use a backend TTL (potentially with multi-backend so that only certain buckets are subject to the TTL);
Use multi-backend with on-disk deletion (i.e. set a bucket you intend to delete to a specific backend, and wipe the backend from the disk when it is no longer required).

There are some problems with these approaches though:

Backend TTLs are clunky, there is a single TTL for the whole backend and the TTL concerns themselves with the Timestamp which the object was added to the backend (which might be different to the timestamp which the object was added to Riak). The Backend TTL of this type is only available in the bitcask and memory backend.
Deletion of this sort is not co-ordinated with the AAE store. The data may have gone from the vnode store, but is still referred to in the AAE store. This doesn't cause an immediate problem, but eventually AAE stores are rebuilt from the vnode store and at this point a rebuilt AAE store for one vnode might now have a huge discrepancy from an AAE store for an overlapping preflist which is pending a rebuild. That can lead to significant activity related to false repair, which may have overheads, or mask a genuine need for repair.
The AAE coordination issue has led to some customers co-ordinating AAE tree rebuilds across the cluster, which runs against the principle of AAE rebuild design (which tries to reduce concurrency of rebuilds due to the overhead of rebuilds). This has led to various issues.
Testing of multi-backend and TTL is sketchy, our test cycles are focused almost exclusively on testing with a single backend. There are some backend capabilities which are lost when running in multi-backend, even where the backend with the capability is used

There are some new tools available for solving these problems:

The Leveled backend has a TTL solution that is not yet implemented within Riak, but allows TTL to be set at object insertion time, which would allow for more flexible arrangements without multi-backend and TTL based on last modified times.
Tictac AAE and NextGen REPL has some improved per-bucket capability for per-bucket configuration of anti-entropy and replication.
Tictac AAE uses leveled as a backend and so would have the TTL as a feature.
There is an unimplemented feature on the discontinued develop branch riak_kv_sweeper to implement lower-cost ways of discovering objects to be changed (i.e. tombstones to reap, objects to expire).
Leveled backends adds a fold_heads feature that allows for discovery of objects to be changed based on metadata-only without the need to lift the object values off disk. The fold_heads feature is available to non-Leveled backends via Tictac AAE parallel stores, folding with the aae_fold feature.
With leveled we have the control to add Riak specific intelligence, in particular per-bucket intelligence (as keys are not arbitrary terms) to support per-bucket storage rules that might facilitate a drop-bucket feature.
Previously tombstone reaping would also be subject to dangerous race conditions (e.g. if a client was managing a PUT when two vnodes had reaped the tombstone, but one had not). However, the work on KV679 means that permanency of deletion is no longer critical to safety. It would now be safe to have tombstones partially reaped within a cluster.

There is a general set of superficial nice-to-haves, features that now seem to be possible to implement:

Auto-expiring tombstones concurrently in backend and AAE, so that tombstones can be long-lived at low overheads, without existing forever.
Per-bucket AAE exclusion rules, where some buckets are considered temporary (e.g. because they are mapped to TTL backends) and so are excluded from AAE entirely (as AAE is primarily concerned with protection agains long-term data entropy).
Flexible auto-expiring object rules using the leveled backend.
A drop-bucket feature (preferably, for simplicity of implementation, a permanent drop bucket feature where writing to a bucket after a drop would not be supported).

There are some general problems though with making such improvements:

There are already "too many" supported options in Riak, adding more options causes further confusion about how Riak will behave, especially as features are inter-connected. For example, Leveled TTL may be "better" than Bitcask TTL, but the fact that it is different is confusing and that confusion is multiplied by the interconnected impact on AAE, repl behaviour (especially as with 2.9.1 there are now multiple forms of both AAE and repl). Then one must further overlay confusion with datatypes, legacy and non-legacy configurations (e.g. typed and non-typed buckets), additional features (e.g. maintaining coordination with Solr via AAE).
Some features may directly contradict. For example, the addition of Tictac AAA and aae_fold provided a key-counting, object-size monitoring feature - but AAE exclusion will then mean this will give false results.
Features are enabled for testing on a per-test basis, so it is hard to get assurance from running regression tests that new features have not had unexpected side effects on old features. This is particularly true of Solr integration.
The majority of Riak customers are "legacy" users (they have a large established cluster where stability is valued is valued as a feature), so features that depend on migration to Leveled/TictacAAE/NextGenRepl eco-system have limited impact. However, funding for change has in recent times come largely from organisations committed to that migration - and although those organisations see long-term value in maintaining the community, they are constrained by their economic self-interest in the extent to which they are prepared to accept increased change costs in the features they desire in order to prove that those changes don't impact other customers' configurations.
Coupling between Solr and the original hashtree AAE implementation makes retirement very hard. There is also an AAE dependency within cluster metadata management, and an alternative AAE implementation in riak_ensemble.

Change here is going to be a difficult balancing act. I think we need to consider change in two contexts:

Minimal short-term change which comes at a cost of minimal short-term confusion an risk;
Longer-term change which ultimately may enforce hard decisions about deprecation.

martinsumner commented 5 years ago

Proposal 1.

The first solution to the problem of mass deletion is to extend the aae_fold find_keys query so that it can be requested to return [{Key :: riak_object:key(), IsTomb :: boolean()}] tuples.

This would allow for keys beyond a certain last modified date in a bucket to be discovered for deletion, and allow for a backend-independent way of maintaining an effective TTL by an application, without requiring multi-backend configuration. Also, as each object would then be deleted via Riak on TTL expiry, AAE consistency is already handled.

This requires the following changes:

A new query option on aae_fold (simple low-cost change);
(Optionally, for efficiency) A new flag on objects in parallel and native AAE stores to indicate if they are a Riak Tombstone (still quite simple, but complicated by testing necessary to support clusters in transition between supporting and not supporting the flag).

This proposal could be extended by adding a TTL to the bucket properties, which would be:

read by the riak_kv_get_fsm, and that FSM would prompt a delete to be completed before returning an answer to the client.
interrogated by a new per-node actor, which will periodically check for buckets with TTL properties, and prompt an aae_fold with subsequent management of deletions. This actor will be scheduled to not overlap with the peer actor on other nodes using similar arrangements to the riak_kv_ttaaefs_manager schedule management.

The extension to the proposal would then give a Riak-managed per-bucket TTL (as opposed to na application-managed one). Results would still not be guaranteed to be consistent with the TTL, as:

There would be anomalies if the riak_kv_get_fsm was on a node with a clock skewed into the future;
2i results would still potentially contain expired answers.

Migration from backend TTL to the new TTL mechanism would be straight forward, though it would depend on enabling Tictac AAE, and managing the extra cost of this. It would then be possible to deprecate the use of backend TTL in clusters using AAE.

This would not be a more efficient answer than backend TTL, but it would be a more predictable answer (especially where backend TTL is applied along with anti-entropy).

martinsumner commented 5 years ago

Proposal 2.

The second proposal is to have a bucket property that can set a TTL for keep tombstones. That is to say we can ask for tombstones to expire eventually.

This wouldn't use a direct timer method as in existing delete, but would allow for a long timer to be set (e.g. 30 days), far beyond a point of which we might expect any tombstone to have not been fully replicated by one mechanism or another.

The implementation of this, could be:

restricted to Leveled and native AAE and using Leveled TTL applied to Riak tombstones only. This would require some extra work to track next expiry time by segment within kv_index_tictactree, simply enabling TTL would lead to load associated with false results (although it might be that the existing partial rebuild on discovery of delta might handle this OK as-is).
similar to proposal 1 being prompted by a new per-node actor, using the extended aae_fold find_keys, and by implementing a riak_kv_reap_fsm.

It might be that neatness and efficiency might not be such a big requirement here. Keeping tombstones is not a massive overhead, and perhaps it might not be enough to allow and administrator to schedule a "reap sweep" for a period of inactivity. It might be nice to have a lot of stuff happen automatically and magically under the hood - but the reality is that it only first needs to be better than the solution we have now. As long as we can eventually reap tombstones if an excess of tombstones becomes a problem, we have advanced the current state.

martinsumner commented 5 years ago

For those with an interest in the problem from a theoretical perspective - this is an interesting piece of research and well worth a read.

This is not implementable for Riak (too radical a change). It isn't a perfect answer (it makes an assumption that persisted data is not lost from disk - which is not something we like to assume in Riak). It is though, a very interesting change in overall approach.

martinsumner commented 4 years ago

The current draft branch for 2.9.1 includes:

riak_kv_reaper; riak_kv_eraser; new aae_folds to trigger reaps and erases.

This allows an application to run aae_folds to either reap a set of tombstones, or delete a range of keys that have not bene modified since a given date. The former is intended to provide a mechanism to reduce the long term cost of running a cluster in the keep mode. The latter is proposed as an alternative to backend TTL - in that it allows the expiry of objects to be managed without creating a discrepancy with AAE stores. This is similar to the original Basho 2.2.5 proposal for riak_kv_sweeper - only in our case efficiency is gained through folding over only heads using the TictacAAE store rather than performing multiple fold actions per fold.

The feature will work on any store with TictacAAE enabled. In order to reap tombstones then tictacaae_storeheads must be set to enabled when running in parallel mode

martinsumner commented 3 years ago

https://github.com/basho/riak_kv/pull/1749

basho / riak_kv

Mass deletion, Automated deletion, Reaping tombstones #1725