Closed qinzuoyan closed 4 years ago
For now, Pegasus only supports reading from the primary. This proposal aims to support reading from secondary. This may improve the condition where the primary is instable (e.g rebalancing, hotspot writing), while other secondaries can serve with lower latency. So reading secondary will lead to better availablity.
Backup request is also known as "hedge request", which was first introduced in The Tail at Scale Dean and Barroso 2013. BRPC also implements this mechanism.
Backup Request is designed to be directly driven by the client, which means when it wants to read from secondary, it can read immediately without any registration to meta or replica, neither any additional communications to the server.
The client sends a read request, with an optionis_backup_request
set in the RPC header, to secondary. When the replica handles the request identifies this option is true, it continues to DB retrieval, otherwise, it responds with ERR_INVALID_STATE
.
[x] Supports is_backup_request
option in RPC header, and allows the secondary replica to handle read requests: https://github.com/XiaoMi/rdsn/pull/255. (Because the previous RPC protocol was not extendable for one more field, we redesigned the RPC protocol, changed request parsing.)
[x] Java client adaption. https://github.com/XiaoMi/pegasus-java-client/pull/93
[x] Performance Testing
I put the related paragraphs described "hedge request" in "The tail at scale" here:
Hedged requests. A simple way to curb latency variability is to issue the same request to multiple replicas and use the results from whichever replica responds first. We term such requests “hedged requests” because a client first sends one request to the replica believed to be the most appropriate, but then falls back on sending a secondary request after some brief delay. The client cancels remaining outstanding requests once the first result is received. Although naive implementations of this technique typically add unacceptable additional load, many variations exist that give most of the latency-reduction effects while increasing load only modestly.
One such approach is to defer sending a secondary request until the first choosing but rather enqueuing copies of a request in multiple servers simultaneously and allowing the servers to communicate updates on the status of these copies to each other. We call requests where servers perform cross-server status updates “tied requests.” The simplest form of a tied request has the client send the request to two different servers, each tagged with the identity of the other server (“tied”). When a request begins execution, it sends a cancellation message to its counterpart. The corresponding request, if still enqueued in the other server, can be aborted immediately or deprioritized substantially.
There is a brief window of one average network message delay where both servers may start executing the request while the cancellation messages are both in flight to the other server. A common case where this situation can occur is if both server queues are completely empty. It is useful therefore for the client to introduce a small delay of two times the average network message delay (1ms or less in modern data-center networks) between sending the first request and sending the second request.
Google’s implementation of this technique in the context of its cluster-level distributed file system is effective at reducing both median and tail latencies. Table 2 lists the times for servicing a small read request from a BigTable where the data is not cached in memory but must be read from the underlying file system; each file chunk has three replicas on distinct machines. The table includes read latencies observed with and without tied requests for two scenarios: The first is a cluster in which the benchmark is running in isolation, in which case latency variability is mostly from self-interference and regular cluster-management activities. In it, sending a tied request that does cross-server cancellation to another file system replica following 1ms reduces median latency by 16% and is increasingly effective along the tail of the latency distribution, achieving nearly 40% reduction at the 99.9th-percentile latency. The second scenario is like the first except there is also a large, concurrent sorting job running on the same cluster contending for the same disk resources in the shared file system. Although overall latencies are somewhat higher due to higher utilization, similar reductions in the latency profile are achieved with the tied-request technique discussed earlier. The latency profile with tied requests while running a concurrent large sorting job is nearly identical to the latency profile of a mostly idle cluster without tied requests. Tied requests allow the workloads to be consolidated into a single cluster, resulting in dramatic computing cost reductions. In both Table 2 scenarios, the overhead of tied requests in disk utilization is less than 1%, indicating the cancellation strategy is effective at eliminating redundant reads.
An alternative to the tied-request and hedged-request schemes is to probe remote queues first, then submit the request to the least-loaded server. It can be beneficial but is less effective than submitting work to two queues simultaneously for three main reasons: load levels can change between probe and request time; request service times can be difficult to estimate due to underlying system and hardware variability, and clients can create temporary hot spots by all clients picking the same (least- loaded) server at the same time. The Distributed Shortest-Positioning Time First system uses another variation in which the request is sent to one server and forwarded to replicas only if the initial server does not have it in its cache and uses cross-server cancellations.
Worth noting is this technique is not restricted to simple replication but is also applicable in more-complex coding schemes (such as Reed-Solomon) where a primary request is sent to the machine with the desired data block, and, if no response is received following a brief delay, a collection of requests is issued to a subset of the remaining replication group sufficient to reconstruct the desired data, with the whole ensemble forming a set of tied requests.
Note, too, the class of techniques described here is effective only when the phenomena that causes variability does not tend to simultaneously affect multiple request replicas. We expect such uncorrelated phenomena are rather common in large-scale systems.
performance test
set/get operation: | test case | enable backup request | read/write propotion | qps | read avg | read p95 | read p99 | read p999 | read p9999 | write avg | write p95 | write p99 | write p999 | write p9999 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3-clients 15-threads | no | 1 : 3 | 7076 | 880.6512836149132 | 428.0 | 727.0 | 138495.0 | 988671.0 | 2495.0710801540517 | 6319.0 | 9023.0 | 36319.0 | 531455.0 | |
3-clients 15-threads | yes, delay 138ms | 1 : 3 | 6987 | 1010.1412488662884 | 403.0 | 7747.0 | 138751.0 | 153599.0 | 2476.104380444753 | 6859.0 | 9119.0 | 13759.0 | 185855.0 | |
3-clients 100-threads | no | 1 : 0 | 140607 | 707.98960978 | 1474.0 | 2731.0 | 5511.0 | 167551.0 | ||||||
3-clients 100-threads | yes, delay 5ms | 1 : 0 | 77429 | 1288.01461934 | 2935.0 | 3487.0 | 6323.0 | 71743.0 | ---- | ---- | ---- | ---- | --- | |
3-clients 30-threads | no | 30 : 1 | 87198 | 306.9600544730426 | 513.0 | 805.0 | 4863.0 | 28271.0 | 1369.4669874672938 | 2661.0 | 5795.0 | 22319.0 | 51359.0 | |
3-clients 30-threads | yes, delay 5ms | 30 : 1 | 88541 | 298.22470022339127 | 493.0 | 711.0 | 4483.0 | 18479.0 | 1467.6130963728997 | 3263.0 | 6411.0 | 17439.0 | 50975.0 |
Multi-get/Batch-Set operation: | test case | enable backup request | read/write porpotion | qps | read avg | read p95 | read p99 | read p999 | read p9999 | write avg | write p95 | write p99 | write p999 | write p9999 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3-clients 7-threads | no | 20 : 1 | 24113 | 200.37956913733476 | 277.0 | 410.0 | 2317.0 | 21647.0 | 2034.1923768463382 | 4283.0 | 6427.0 | 18271.0 | 62687.0 | |
3-clients 7-threads | yes, deley 2ms | 20 : 1 | 23756 | 197.48540031650361 | 268.0 | 351.0 | 2173.0 | 5759.0 | 2187.199077764627 | 4531.0 | 6551.0 | 21551.0 | 63999.0 | |
3-clients 15-threads | no | 20 : 1 | 30980 | 236.7482510418767 | 348.0 | 526.0 | 3535.0 | 25695.0 | 5361.380053671262 | 14087.0 | 20223.0 | 40639.0 | 90815.0 | |
3-clients 15-threads | yes, delay 3ms | 20 : 1 | 30483 | 244.1182599024727 | 386.0 | 540.0 | 3105.0 | 13287.0 | 5377.992155339365 | 14119.0 | 19535.0 | 31311.0 | 103103.0 |
the backup requests will be sent to the secondary replicas.