Soft threshold checks for coordinators

martinsumner commented 6 years ago

Background

Riak for most operations does a good job of working around the slow node problem. In sending n requests for a GET and being able to return a response on r replies, low variance in client response times can still be achieved even when nodes are impaired by hardware failure (e.g. loss of disk in a RAID array) or software interruption (e.g AAE tree rebuilds, page cache pollution).

Riak does not necessarily deliver a low median response time, but it does deliver a relatively low maximum/high-percentile response time.

This property does not necessarily hold though when an operation requires a coordinator, for example PUT operations. When a coordinator is selected for a PUT, if the coordinating vnode is on a slow node, or for vnode specific reasons has a large mailbox queue - the PUT will run at the pace of that vnode. In testing at the NHS under certain high pressure scenarios significant increases in mean PUT response times have been seen, driven by a sharp increase in slow outliers, as vnode mailbox queues rise, and those vnodes take their equal share of chances to be the PUT coordinator.

These circumstances can be triggered by big events (such as the failures and expensive processes previously referred to), but also by relatively hard to detect events (larger than usual compaction activity in a vnode, a node acting as coordinator for an exceptionally large index query etc). Running a cluster with disk utilisation at around 85% will lead to frequent occurrences of bursts of very slow PUTs due to short-term fluctuations between nodes in the vnode mailbox queue size, with bursts where queue sizes across a node exceed 100 messages.

Although only PUT currently has a coordinator, there is a proposal from @martincox to use coordinators for GETs as well. This is to allow Riak to avoid unnecessarily doing n GETs, and instead do 1 GET and n-1 HEAD requests (in most scenarios) - reducing both the overall vnode load and network bandwidth used, especially when values are large. To implement this though, a vnode must be nominated by the GET FSM to be the one responsible for the GET and not the HEAD (and the response to the client will be at least as slow as this single vnode (as with the coordinated PUT).

It would be preferable if this efficiency change could be made without running the risk of seeing response time volatility as individual vnodes experience increased queue lengths, for example due to additional background workload on that node. Previously it had been suggested that this could be avoided by using n-HEADs followed by 1-GET (https://github.com/martinsumner/leveled/blob/master/docs/FUTURE.md#n-heads-1-get-or-1-get-n-1-heads), but this requires an additional backend message, and does nothing to directly help the PUT scenario (other than through indirectly reducing the risk of larger mailbox queues by diverting more expensive GET operations from vnodes with backlogs).

Proposal

It is proposed that the general problem of selecting 1 out of n vnodes for extra work, i.e. selecting a coordinator), could be addressed by using soft threshold mailbox queue checks to the vnode proxy. These checks will be "soft" in comparison to the current "hard" overload checks made at the vnode proxy. The checks will allow for the choice of coordinator to be validated before it is used, at a very low cost, with no additional backend work required.

Mailbox Monitoring

Currently all messages to vnodes pass through the riak_core_vnode_proxy (https://github.com/basho/riak_core/blob/2.1.9/src/riak_core_vnode_proxy.erl).

This proxy monitors the mailbox queue size (very loosely) in order to detect the overload state (https://github.com/basho/riak_core/blob/2.1.9/src/riak_core_vnode_proxy.erl#L277-L288). This state, effectively fails all messages received once the mailbox has reached the default threshold of 10000 messages. [Note that handle_overload_command is not part of the required vnode behaviour, but is implemented in riak_kv_vnode to return an error]. The default threshold here is probably two orders of magnitude too large for this purpose, and the consequent action unhelpful.

The proposal is to monitor the mailbox size more accurately within the vnode_proxy, so that the proxy can handle a soft_overload ping message responding with the approximation of the current mailbox size. This could be achieved by using the same polling algorithm with a lower request interval (https://github.com/basho/riak_core/blob/2.1.9/src/riak_core_vnode_proxy.erl#L240), without reducing significantly the interval for the expensive/accurate check (https://github.com/basho/riak_core/blob/2.1.9/src/riak_core_vnode_proxy.erl#L254). A request interval of 50 is perhaps a good starting point.

The vnode_proxy could then receive a new queue_size message, that would return the current queue estimate, or false if the current queue estimate is less than twice the poll frequency (100 if a request interval of 50 is used by default). This queue estimate could then be used when selecting coordinators.

Selecting A Coordinator

The vnode_proxy queue_size message can offer an immediate response from the proxy LoopState without forwarding to the backend impacting any backlog there. So if a coordinator is chosen by a PUT or GET FSM, with presumably local vnodes in the preflist chosen first as at present - a queue_size message may be sent first to see if the current mailbox queue size is below the threshold (i.e. false is returned). If the queue size is above the threshold other vnodes may be polled for queue_size and the shortest queue chosen.

This queue_size request could be made for every coordinated request. Alternatively, a cache of queue size responses for each node could be kept in a name referenced ETS table, and a poll only be used if the cached entry in the table for the chosen vnode has expired. Given that the check is so low cost, perhaps sending one message for every coordinator nomination is acceptable.

FSMs waiting for a queue_size message should do so with an aggressive timeout (e.g. 1s), and assume the queue size is currently infinite if the timeout is hit without a response to the queue_size message. If no vnodes can return a queue_size within the timeout, then this should be treated as an overload error.

In the current PUT FSM implementation it would probably be preferable to make the queue_size check before making the forwarding decision: i.e. currently the PUT FSM chooses not to forward is a vnode in the preflist is local, and forward to the first vnode in the preflist otherwise.

martinsumner commented 6 years ago

@russelldb - does this explain it well enough?

russelldb commented 6 years ago

it does @martinsumner, thanks. IMO no need to cache in ETS unless there is an observed need to cache in ETS (i.e. we should do the simple work and benchmark/profile before adding a cache.)

jadeallenx commented 6 years ago

Sorry to be that guy - I'm just wondering if you fired off a request to some random node, maybe you should send another request 5 milliseconds later to other (possibly different) nodes. This would be like the techniques that are described in The Tail at Scale.

martinsumner commented 6 years ago

As in a hedged request for the queue_size message or something more fundamentally different? For the queue_size message I think that's interesting idea, and seems more helpful than having to rely so much on the aggressive timeout.

That paper is on my reading list for tomorrow now :-)

jadeallenx commented 6 years ago

Yes, exactly. The hedged request method showed substantial improvements in median and mode, and reduced tail latency quite a bit.

martinsumner commented 6 years ago

There are some hints to the potential benefits of this from the following volume test write up:

https://github.com/martinsumner/leveled/blob/master/docs/VOLUME.md#riak-cluster-test---phase-6---nhs-hardware-compare

Working around individual mailbox queues prior to the overload state being entered, is important to maintaining stable response times.

martinsumner commented 6 years ago

Been thinking tonight, about which vnode to initially choose at the GET coordinator.

Initially I had thought:

go local;
if no local vnode in the preflist choose at random.

The issue though is cache efficiency. If we assume requests are balanced equally across nodes, then requests for a given object will have its coordinator balanced evenly across the three vnodes. However, we're doing nothing to help the cache. If the same vnode was chosen as coordinator each time (subject to soft overload check), then the page cache could be potentially 3 x as efficient for reads (as we would not be wasting pages in the cache on the nodes which hold the other vnodes).

A cache miss is more expensive than a network round trip, iff it results in a disk head movement on a hard disk drive (which will in reality be every read that misses unless we're pulling objects sequentially).

Perhaps so as not to hamper those using SSDs, the algorithm should be:

go local;
if no local vnode in the preflist choose the head of the preflist.

So if there's an 8 node cluster, 75% of GET requests will result in the head of the preflist being chosen as the GET coordinator (subject to the soft overload check), with that percentage increasing as the cluster grows.

In the volume tests referred to above, performance drops sharply as disk seeks start occurring on HDDs - so a dramatic decrease in disk seeks through more cache intelligent routing could offer significant benefits.

[Note - this whole cache efficiency thing is only relevant to Riak + leveled at present, as in no other backend can the fetch from disk of the whole object be avoided in the nodes other than the GET coordinator. Where a HEAD request has been added (e.g. to Bitcask, it is a backend GET and then a strip to HEAD for network efficiency only)].

russelldb commented 6 years ago

In the spirit of doing the least amount of work possible to decide if this is a worthwhile approach, how does this sound:

lower the {riak_core, vnode_check_request_interval} to 50
add a function to vnode proxy get_mailbox_size() -> pos_integer()
update put fsm with the if MBox > SoftLimit logic

Then run the volume tests. Too minimal?

martinsumner commented 5 years ago

For PUTs this is now added to 2.9.

Whether this should be extended to GETs is still an outstanding question

martinsumner commented 4 years ago

The GET question will now be resolved in #1692. Implemented with PUTs, as well as a switch to disable the use of the proxy poll https://github.com/basho/riak_kv/commit/24ec5d26459c59fcdd75f30ae92a6b3bc83c729e

basho / riak_kv