basho / riak

Riak is a decentralized datastore from Basho Technologies.
http://docs.basho.com
Apache License 2.0
3.95k stars 536 forks source link

Feature request : Time window for AAE rebuild operations ala Bitcask [JIRA: RIAK-2684] #856

Open binarytemple-external opened 8 years ago

binarytemple-external commented 8 years ago

Overview:

In a latency sensitive environment - an issue can arise when:

The current mechanism lacks the granularity to support this behaviour.

As a fallback measure in addition to the normal operation of AAE on-disk hash trees, Riak periodically clears and regenerates all hash trees stored on disk to ensure that hash trees correspond to the key/value data stored in Riak. This enables Riak to detect silent data corruption resulting from disk failure or faulty hardware. The anti_entropy.tree.expiry setting enables you to determine how often that takes place. The default is once a week (1w). You can set up this process to run once a day (1d), twice a day (12h), once a month (4w), and so on.

Suggested solution

To ensure predictable latency during business hours - AAE tree rebuild could support the same time window mechanism as Bitcask merge operations

bitcask.merge.policy Lets you specify when during the day merge operations are allowed to be triggered. Valid options are: always, meaning no restrictions; never, meaning that merging will never be attempted; and window, specifying the hours during which merging is permitted, where bitcask.merge.window.start and bitcask.merge.window.end are integers between 0 and 23. If merging has a significant impact on performance of your cluster, or your cluster has quiet periods in which little storage activity occurs, you may want to change this setting from the default.

nerophon commented 8 years ago

Seems sensible to me.

Basho-JIRA commented 8 years ago

Consider for 2.3. Needs some design and thought. 2.2 is on a faster path.

_[posted via JIRA by Patricia Brewer]_

haraldmosh commented 7 years ago

Anything happening with this? We're seeing it in production. Under heavy load it seems to cause large queues which eventually cause riak to complain it's overloaded.

binarytemple commented 7 years ago

@haraldmosh - it doesn't seem a priority at the moment - been a long time in the queue - recommend you contact Basho sales if you need it as a supported feature.

binarytemple commented 7 years ago

One kludge could be to run a cron job which would execute something like the following ( originally from @engelsanchez ) on one of your Riak nodes:

To dynamically disable AAE from the Riak console, you can run this command:

 riak_core_util:rpc_every_member_ann(riak_kv_entropy_manager, disable, [], 60000).

and enable with the similar:

  riak_core_util:rpc_every_member_ann(riak_kv_entropy_manager, enable, [], 60000).

That last number is just a timeout for the RPC operation. I hope this saves you some extra load on your clusters.