basho / riak

Riak is a decentralized datastore from Basho Technologies.
http://docs.basho.com
Apache License 2.0
3.93k stars 534 forks source link

AAE and Expiry #981

Open Bob-The-Marauder opened 5 years ago

Bob-The-Marauder commented 5 years ago

At the moment, AAE does not handle objects that expire. As such, even though the data has gone, the entry in the AAE hashtree remains until the hashtree is cleared and rebuilt. Should you have a lot of items that expire, this can cause AAE clearing to take a very long period of time.

Quick example: I store user sessions in one backend and they expire after an hour. AAE runs once monthly.

Assuming that AAE just completed: After 1 hour, I have 1 hour's worth of data and 1 hour's worth of AAE data. All good. After 2 hours, I have 1 hours' worth of data and 2 hours' worth of AAE data. Ok but not ideal. After 3 hours, I have 1 hours' worth of data and 3 hours' worth of AAE data. Ok but not ideal. After 24 hours, I have 1 hours' worth of data and 24 hours' worth of AAE data. Not very happy. After 48 hours, I have 1 hours' worth of data and 48 hours' worth of AAE data. Annoyed now. After 30 days, I have 1 hours' worth of data and 720 hours' worth of AAE data. Somewhat insane.

Now AAE gets to clear the hashtree but during the hashtree clearing process, should more than 90 hashtree tokens be used, the node becomes locked until clearing completes i.e. incoming writes have to wait and start to time out.

As hashtree clearing deletes one item at a time from the hashtree, the above hashtree bloat becomes a significant time waster. Even assuming an average of only 500 sessions per hour that is 360,000 records to be cleared where there should only be 500. Estimate 1 ms to clear each item from the hashtree and we observe 100th percentile node put fsms taking in the region of 5-6 minutes to complete in the worst cases.

To address the above, I have two suggestions:

  1. Adjust expiry to be a regular delete i.e. when called, it deletes the entry from the AAE hashtree at the same time it deletes the data to be expired.
  2. Adjust the way that hashtree clearing works so that it deletes all data in one go ala rm -rf <folder> style rather than the current iteration over each item.
martinsumner commented 5 years ago

In the circumstances described, it is not clear why the process of "clearing trees" prior to the rebuilding of trees should take 5 - 6 minutes. I can't find evidence that it is "clearing "one item at a time", however the logs do clearly indicate a 5 - 6 minutes delay is involved.

Within this 5-6 minutes window, the following occurs to "clear" the trees:

Given that the 5 to 6 minute delay is reality, and that attempting to modify/improve the eleveldb functions seems risky - something needs to be done about coping with that delay. The kv_index_hashtree process will be busy for the period, so hitting the aae tokens limit is hard to avoid.

Most riak customers would not experience a problem here, as concurrent rebuilds are blocked by configuration, and also since 2.9.0 the vnode_proxy polling would steer writes away from being coordinated by the blocked vnode.

However, if AAE is run where some buckets have a TTL, spreading out the AAE rebuilds causes problems - as the AAE tree has no way of expiring objects other than through rebuilds. So if rebuilds happen at different times, then the trees will remain inconsistent and constantly lead to false repairs.

There are three problems:

1 - why for some customers is clearing the trees prior to rebuild taking so long? 2 - can active anti-entropy be managed so that expiring objects can be supported without the need to coordinate rebuilds? 3 - if rebuilds are coordinated, can scripts prompting the rebuild manage the async token mechanism so that locking during the rebuild process does not halt the vnode.

martinsumner commented 5 years ago

To tackle (3), there are two proposals:

A - allow for scripts to reset the current token bucket size to a new integer, by adding a new API call for the riak_kv_vnode, and an API call on the kv_entropy_manager to roll over all active primary local vnodes making tis change. Hence scripts could lift the bucket siz eprior to any change, and reduce it back afterwards. Note it is not enough to change the riak_kv.anti_entropy_max_async value as this only alters the reset value when the token bucket next expires.

B - Apply a non-infinite timeout to the non-asnyc insert/delete calls on kv_index_hashtree (e.g. https://github.com/basho/riak_kv/blob/2.1.4/src/riak_kv_index_hashtree.erl#L125). This would still delay the vnode if it is ahead of the hashtree process, and also cause some updates to AAE to be missed. However this appears to be a better compromise than locking the vnode until the kv_index_hashtree process is unlocked.

Both changes are fairly minimal from a code perspective, but are hard to generate tests for. Currently I don't think there are any tests of this scenario, so we would either have to rely on code review, or spend some time to think out a test.

martinsumner commented 5 years ago

With regards to the longer-term answer, there needs to be a way of excluding buckets from AAE (either legacy AAE or cache Tictac AAE). This will allow for administrators to run AAE for non-expiring buckets only.

Expiring buckets would need to fallback to read repair on legacy AAE (which given their temporary nature may be acceptable). For Tictac AAE, expiring buckets could be handled separately using per-bucket AAE (which would not have the coordination issue).

martinsumner commented 5 years ago

My preference for the short term answer is currently (B). The intention of the aae tokens on vnode PUTs was to slow down a vnode that may be getting ahead of an AAE tree, but would a finite slowdown not be more appropriate?

Currently a blocked kv_index_hashtree causes a blocked vnode, even when the vnode is under light load (and the kv_index_hashtree mailbox could be drained rapidly once the blockage is removed).

Anecdotally aae token depletion is commonly a problem, or at least is commonly perceived to be a problem.

martinsumner commented 5 years ago

Note if a timeout is applied to the call, the reply will eventually be received as a new message on the riak_kv vnode - and will need to be handled without crashing the riak_kv vnode.

martinsumner commented 5 years ago

One of the problems with developing an alternative is that the current mechanism is probably ideal for the situation that it was designed to tackle. The expectation was that there maybe situations (such as when using the in-memory backend in pure PUT loads) when the vnode backend could be slightly faster than the AAE backend.

In this case, where there is a small but regular delta between the AAE PUT time and the vnode PUT time, frequently syncing using a token mechanism seems like a good fit. A more complex method (such as monitoring the queue size) may make performance less regular and vnode delays ore erratic. Whereas the current token mechanism means that when delays occur they are no more than anti_entropy_max_async * delta.

The mechanism doesn't handle long and infrequent lock-ups in the kv_index_hashtree, but any method that would is likely to be sub-optimal for the frequent delta scenario.

martinsumner commented 5 years ago

More information on original implementation:

https://github.com/basho/riak_kv/commit/aceac76814140e65c9ed6a7f42e6a2b6d4190c8f

martinsumner commented 5 years ago

It should be noted that in 2.9.0, this was simply abandoned for Tictac AAE. The token bucket works as-is for the original kv_index_hashtree solution, but there is no token/queue management at all for kv_index_tictactree

martinsumner commented 5 years ago

Looking into a longer term solution, it may actually be a relative small amount of additional effort relative to implementing workarounds. The starting point of this problem is a riak implementation where some buckets (through alternate backend mappings) have a TTL and so should NOT be in the AAE store.

Because they are in the AAE store, there is a cycle of events that increase both the size of the AAE stores (and the cost of clearing) and require coordination of clearing (hence increasing the impact of clearing).

A new bucket property could be added - aae_exclude

This bucket property would default to false. However, on a bucket (or bucket type) it may be set to true. This property would be checked in two circumstances:

This would allow for the operator to exclude TTL-based buckets from AAE if they require. This would also exclude these buckets from full-sync replication with Tictac AAE (although not from per-bucket full-sync replication, so the option would still exist to replicate). Perhaps confusingly, the objects would still be included in key-listing full-sync.

Bob-The-Marauder commented 4 years ago

Some interesting test results:

5 servers each with: Dual Core Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz 8GB RAM SSD storage (Amazon T2 Large)

Riak KV 2.1.4 EE with an n_val of three and 100,000,000 keys spread equally across 1,000,000 buckets. If AAE trees are either expired in bulk or by setting expiry to be so rapid that rebuilds happen constantly (as used in these tests), we find the following:

AAE simultaneous rebuilds set as 2


 Server | Average Clearing Time | Average Building Time
------- | --------------------- | ----------------------
Alpha   |     00:09:35.623      |     00:47:54.067
Beta    |     00:09:15.381      |     00:50:39.654
Charlie |     00:10:05.127      |     00:49:43.554
Delta   |     00:08:05.421      |     00:41:59.357
Echo    |     00:09:32.614      |     00:48:00.049
Average |     00:09:18.833      |     00:47:39.336

AAE simultaneous rebuilds set as 1


 Server | Average Clearing Time | Average Building Time
------- | --------------------- | ----------------------
Alpha   |     00:06:05.283      |     00:29:05.762
Beta    |     00:06:06.507      |     00:32:13.877
Charlie |     00:06:08.078      |     00:31:09.151
Delta   |     00:05:48.752      |     00:28:27.017
Echo    |     00:06:15.801      |     00:29:46.228
Average |     00:06:04.884      |     00:30:08.407

Times are formatted in hours:minutes:seconds:milliseconds

Even though there were no obvious resource contention issues on the servers, we had approximately 150 hashtree clear/build processes complete for each of the above tests where the data was collected over a 24 hour period. Even more interesting is that the times for AAE Simultaneous rebuilds set as 2 was the same as that for 1 until the rebuilds start and finish times began to overlap each other and then the times began to increase until they stabilised at the above. To compensate for this, the above tests were run after the servers had been running for over 24 hours each.

martinsumner commented 4 years ago

This is merged in for the p4 release of 2.9.0. It will now be possible to view and change the current size fo the aae hashtree_token pool across all online primaries - so that the size of the pool can be temporarily lifted.

There are two functions added to riak_kv_util to achieve this:

https://github.com/basho/riak_kv/blob/55fc228e2e09cd15786a054f64fffc45eb3211c7/src/riak_kv_util.erl#L201-L238

martinsumner commented 4 years ago

https://github.com/basho/riak/tree/riak-2.9.1 also now includes initial work on adding an eraser, reaper.

The intention is when TTL is used for garbage collection of old objects, then there is an alternate AAE friendly process for doing that garbage collection - which would avoid the need for co-ordinated AAE rebuilds.

For 2.9.2 release the intention is to add some scheduling support within Riak for automating the deletion of objects on expiry