Proposal: Backup Request

Author(s): @levy5307 @neverchanje

Background

For now, Pegasus only supports reading from the primary. This proposal aims to support reading from secondary. This may improve the condition where the primary is instable (e.g rebalancing, hotspot writing), while other secondaries can serve with lower latency. So reading secondary will lead to better availablity.

Backup request is also known as "hedge request", which was first introduced in The Tail at Scale Dean and Barroso 2013. BRPC also implements this mechanism.

Design

Backup Request is designed to be directly driven by the client, which means when it wants to read from secondary, it can read immediately without any registration to meta or replica, neither any additional communications to the server.

The client sends a read request, with an optionis_backup_request set in the RPC header, to secondary. When the replica handles the request identifies this option is true, it continues to DB retrieval, otherwise, it responds with ERR_INVALID_STATE.

TODO

[x] Supports is_backup_request option in RPC header, and allows the secondary replica to handle read requests: https://github.com/XiaoMi/rdsn/pull/255. (Because the previous RPC protocol was not extendable for one more field, we redesigned the RPC protocol, changed request parsing.)
[x] Java client adaption. https://github.com/XiaoMi/pegasus-java-client/pull/93
[x] Performance Testing

neverchanje commented 4 years ago

I put the related paragraphs described "hedge request" in "The tail at scale" here:

Hedged requests. A simple way to curb latency variability is to issue the same request to multiple replicas and use the results from whichever replica responds first. We term such requests “hedged requests” because a client first sends one request to the replica believed to be the most appropriate, but then falls back on sending a secondary request after some brief delay. The client cancels remaining outstanding requests once the first result is received. Although naive implementations of this technique typically add unacceptable additional load, many variations exist that give most of the latency-reduction effects while increasing load only modestly.

One such approach is to defer sending a secondary request until the first choosing but rather enqueuing copies of a request in multiple servers simultaneously and allowing the servers to communicate updates on the status of these copies to each other. We call requests where servers perform cross-server status updates “tied requests.” The simplest form of a tied request has the client send the request to two different servers, each tagged with the identity of the other server (“tied”). When a request begins execution, it sends a cancellation message to its counterpart. The corresponding request, if still enqueued in the other server, can be aborted immediately or deprioritized substantially.

There is a brief window of one average network message delay where both servers may start executing the request while the cancellation messages are both in flight to the other server. A common case where this situation can occur is if both server queues are completely empty. It is useful therefore for the client to introduce a small delay of two times the average network message delay (1ms or less in modern data-center networks) between sending the first request and sending the second request.

Google’s implementation of this technique in the context of its cluster-level distributed file system is effective at reducing both median and tail latencies. Table 2 lists the times for servicing a small read request from a BigTable where the data is not cached in memory but must be read from the underlying file system; each file chunk has three replicas on distinct machines. The table includes read latencies observed with and without tied requests for two scenarios: The first is a cluster in which the benchmark is running in isolation, in which case latency variability is mostly from self-interference and regular cluster-management activities. In it, sending a tied request that does cross-server cancellation to another file system replica following 1ms reduces median latency by 16% and is increasingly effective along the tail of the latency distribution, achieving nearly 40% reduction at the 99.9th-percentile latency. The second scenario is like the first except there is also a large, concurrent sorting job running on the same cluster contending for the same disk resources in the shared file system. Although overall latencies are somewhat higher due to higher utilization, similar reductions in the latency profile are achieved with the tied-request technique discussed earlier. The latency profile with tied requests while running a concurrent large sorting job is nearly identical to the latency profile of a mostly idle cluster without tied requests. Tied requests allow the workloads to be consolidated into a single cluster, resulting in dramatic computing cost reductions. In both Table 2 scenarios, the overhead of tied requests in disk utilization is less than 1%, indicating the cancellation strategy is effective at eliminating redundant reads.

An alternative to the tied-request and hedged-request schemes is to probe remote queues first, then submit the request to the least-loaded server. It can be beneficial but is less effective than submitting work to two queues simultaneously for three main reasons: load levels can change between probe and request time; request service times can be difficult to estimate due to underlying system and hardware variability, and clients can create temporary hot spots by all clients picking the same (least- loaded) server at the same time. The Distributed Shortest-Positioning Time First system uses another variation in which the request is sent to one server and forwarded to replicas only if the initial server does not have it in its cache and uses cross-server cancellations.

Worth noting is this technique is not restricted to simple replication but is also applicable in more-complex coding schemes (such as Reed-Solomon) where a primary request is sent to the machine with the desired data block, and, if no response is received following a brief delay, a collection of requests is issued to a subset of the remaining replication group sufficient to reconstruct the desired data, with the whole ensemble forming a set of tied requests.

Note, too, the class of techniques described here is effective only when the phenomena that causes variability does not tend to simultaneously affect multiple request replicas. We expect such uncorrelated phenomena are rather common in large-scale systems.

levy5307 commented 4 years ago

performance test

set/get operation:	test case	enable backup request	read/write propotion	qps	read avg	read p95	read p99	read p999	read p9999	write avg	write p95	write p99	write p999
3-clients 15-threads	no	1 : 3	7076	880.6512836149132	428.0	727.0	138495.0	988671.0	2495.0710801540517	6319.0	9023.0	36319.0	531455.0
3-clients 15-threads	yes, delay 138ms	1 : 3	6987	1010.1412488662884	403.0	7747.0	138751.0	153599.0	2476.104380444753	6859.0	9119.0	13759.0	185855.0
3-clients 100-threads	no	1 : 0	140607	707.98960978	1474.0	2731.0	5511.0	167551.0
3-clients 100-threads	yes, delay 5ms	1 : 0	77429	1288.01461934	2935.0	3487.0	6323.0	71743.0	----	----	----	----	---
3-clients 30-threads	no	30 : 1	87198	306.9600544730426	513.0	805.0	4863.0	28271.0	1369.4669874672938	2661.0	5795.0	22319.0	51359.0
3-clients 30-threads	yes, delay 5ms	30 : 1	88541	298.22470022339127	493.0	711.0	4483.0	18479.0	1467.6130963728997	3263.0	6411.0	17439.0	50975.0

Multi-get/Batch-Set operation:	test case	enable backup request	read/write porpotion	qps	read avg	read p95	read p99	read p999	read p9999	write avg	write p95	write p99	write p999
3-clients 7-threads	no	20 : 1	24113	200.37956913733476	277.0	410.0	2317.0	21647.0	2034.1923768463382	4283.0	6427.0	18271.0	62687.0
3-clients 7-threads	yes, deley 2ms	20 : 1	23756	197.48540031650361	268.0	351.0	2173.0	5759.0	2187.199077764627	4531.0	6551.0	21551.0	63999.0
3-clients 15-threads	no	20 : 1	30980	236.7482510418767	348.0	526.0	3535.0	25695.0	5361.380053671262	14087.0	20223.0	40639.0	90815.0
3-clients 15-threads	yes, delay 3ms	20 : 1	30483	244.1182599024727	386.0	540.0	3105.0	13287.0	5377.992155339365	14119.0	19535.0	31311.0	103103.0

apache / incubator-pegasus

support backup request to alleviate the long tail problem #251

Proposal: Backup Request

Background

Design

TODO