basho / riak_kv

Riak Key/Value Store
Apache License 2.0
653 stars 233 forks source link

Deletion and full_sync #1869

Open martinsumner opened 1 year ago

martinsumner commented 1 year ago

When deleting the riak_kv_delete process will create a tombstone and push the tombstone across the preflist. Then, it will fetch the tombstone using riak_kv_get_fsm, as if all vnodes return a tombstone, the riak_kv_get_fsm is the process which prompts for those tombstones to be reaped.

The reaping uses the riak_kv_vnode:del/3 function, which will confirm that the locally stored object is a tombstone, and then if the delete_mode requires either an immediate or delayed reap, the reap will be prompted as appropriate.

The riak_kv_get_fsm has a safety check. If all primaries are not currently up the reap is not prompted. This prevents a down or partitioned primary with an old object resurrecting that old object when the vnode reconnects.

When doing replication this will cause a natural delta to form between clusters when a node is down in one of the clusters. For all objects which are deleted, if the delete_mode is not keep, and if a failed node is within the perfect preflist, then the cluster with the failure will have a tombstone, whereas the cluster without the failure will have no tombstone.

This should be safe (i.e. a fresh write on the cluster where the tomb has been reaped should form a sibling and not be dominated by the tombstone). This should also be resolved through full-sync. However, resolution through full-sync may be time consuming, and more interesting sync issues may be masked (and have their resolution delayed) by this tombstone discrepancy.