Azure / azure-cosmosdb-java

Java Async SDK for SQL API of Azure Cosmos DB
MIT License
54 stars 61 forks source link

Add load-balanced channel support to RntbdTransportClient #117

Closed David-Noble-at-work closed 5 years ago

David-Noble-at-work commented 5 years ago

RntbdTransportClient service endpoints are no longer restricted to a single channel. They now allocate, acquire, and release channels from a pool. Channels are created on demand and released to the pool immediately after acquisition. This implementation provides most core manageability, performance, and scalability features, including.

One outstanding feature is left to-do: #119, health check requests. I will then turn my attention to performance, reliability, and scalability work.

Issues addressed in this PR:

Class diagram (a guide to the code)

Code reviewers unfamiliar with the RntbdTransportClient might find this diagram useful as a guide.

RntbdTransportClient

Read latency benchmark

Short story: Direct TCP out performed Direct HTTPS by about 9-10%:

Protocol RPS Latency (ms) StdDev
TCP 6,172 1.57 0.16
HTTPS 5,658 1.75 0.44

Parameters:

Direct TCP

-- Meters ----------------------------------------------------------------------
#Successful Operations
             count = 1000000
         mean rate = 6171.68 events/second
     1-minute rate = 6100.06 events/second
     5-minute rate = 5448.40 events/second
    15-minute rate = 5136.31 events/second
#Unsuccessful Operations
             count = 0
         mean rate = 0.00 events/second
     1-minute rate = 0.00 events/second
     5-minute rate = 0.00 events/second
    15-minute rate = 0.00 events/second

-- Timers ----------------------------------------------------------------------
Latency
             count = 1000000
         mean rate = 6171.72 calls/second
     1-minute rate = 6100.25 calls/second
     5-minute rate = 5449.72 calls/second
    15-minute rate = 5138.17 calls/second
               min = 1.27 milliseconds
               max = 3.00 milliseconds
              mean = 1.57 milliseconds
            stddev = 0.16 milliseconds
            median = 1.54 milliseconds
              75% <= 1.63 milliseconds
              95% <= 1.82 milliseconds
              98% <= 1.93 milliseconds
              99% <= 2.27 milliseconds
            99.9% <= 3.00 milliseconds

Direct HTTPS

-- Meters ----------------------------------------------------------------------
#Successful Operations
             count = 1000000
         mean rate = 5658.76 events/second
     1-minute rate = 5609.83 events/second
     5-minute rate = 4936.33 events/second
    15-minute rate = 4588.23 events/second
#Unsuccessful Operations
             count = 0
         mean rate = 0.00 events/second
     1-minute rate = 0.00 events/second
     5-minute rate = 0.00 events/second
    15-minute rate = 0.00 events/second

-- Timers ----------------------------------------------------------------------
Latency
             count = 1000000
         mean rate = 5658.78 calls/second
     1-minute rate = 5609.89 calls/second
     5-minute rate = 4937.12 calls/second
    15-minute rate = 4589.39 calls/second
               min = 1.42 milliseconds
               max = 10.04 milliseconds
              mean = 1.75 milliseconds
            stddev = 0.44 milliseconds
            median = 1.68 milliseconds
              75% <= 1.79 milliseconds
              95% <= 2.07 milliseconds
              98% <= 2.40 milliseconds
              99% <= 3.17 milliseconds
            99.9% <= 10.04 milliseconds

Write Latency Benchmark

Short story: Direct TCP underperformed Direct HTTPS by a little: 1-2%. This is well within a single standard deviation.

Protocol RPS Latency (ms) StdDev
TCP 1,860 5.35 1.11
HTTPS 1,880 5.27 0.70

Parameters:

concurrency=10 consistency=Eventual operation_count=1000000

Direct TCP

-- Meters ----------------------------------------------------------------------
#Successful Operations
             count = 1000000
         mean rate = 1859.63 events/second
     1-minute rate = 1865.44 events/second
     5-minute rate = 1818.98 events/second
    15-minute rate = 1721.30 events/second
#Unsuccessful Operations
             count = 0
         mean rate = 0.00 events/second
     1-minute rate = 0.00 events/second
     5-minute rate = 0.00 events/second
    15-minute rate = 0.00 events/second

-- Timers ----------------------------------------------------------------------
Latency
             count = 1000000
         mean rate = 1859.64 calls/second
     1-minute rate = 1865.43 calls/second
     5-minute rate = 1819.05 calls/second
    15-minute rate = 1721.52 calls/second
               min = 3.84 milliseconds
               max = 18.26 milliseconds
              mean = 5.35 milliseconds
            stddev = 1.11 milliseconds
            median = 5.10 milliseconds
              75% <= 5.52 milliseconds
              95% <= 7.70 milliseconds
              98% <= 9.22 milliseconds
              99% <= 9.70 milliseconds
            99.9% <= 13.04 milliseconds

Direct HTTPS

-- Meters ----------------------------------------------------------------------
#Successful Operations
             count = 1000000
         mean rate = 1880.48 events/second
     1-minute rate = 1885.75 events/second
     5-minute rate = 1811.79 events/second
    15-minute rate = 1652.15 events/second
#Unsuccessful Operations
             count = 0
         mean rate = 0.00 events/second
     1-minute rate = 0.00 events/second
     5-minute rate = 0.00 events/second
    15-minute rate = 0.00 events/second

-- Timers ----------------------------------------------------------------------
Latency
             count = 1000000
         mean rate = 1880.49 calls/second
     1-minute rate = 1885.73 calls/second
     5-minute rate = 1811.82 calls/second
    15-minute rate = 1652.26 calls/second
               min = 4.09 milliseconds
               max = 14.04 milliseconds
              mean = 5.27 milliseconds
            stddev = 0.70 milliseconds
            median = 5.20 milliseconds
              75% <= 5.58 milliseconds
              95% <= 6.11 milliseconds
              98% <= 6.56 milliseconds
              99% <= 7.68 milliseconds
            99.9% <= 12.69 milliseconds