aerospike / aerospike-client-go

Aerospike Client Go
Apache License 2.0
433 stars 199 forks source link

Any ideas to warmup connections? #256

Closed un000 closed 5 years ago

un000 commented 5 years ago

When starting a service we got many errors like:

last error: No available connections to the node. Connection Pool was empty, and limited to certain number of connections.

We have like 40k qps per instance into Aerospike, so connections are not warmed up at the start. It takes a lot of time to initiate the new ones. Increasing queue length helps a bit.

So it will be good to warm up conn. queue at the start by setting some options.

khaf commented 5 years ago

Coming next release.

un000 commented 5 years ago

Also, we can't update to 2.0 because of timeout errors:

03:49:11 [50154]: {"level":"info","ts":"2019-04-04T03:49:11.206-0500","caller":"aerospike-logger/aerospike_logger.go:18","msg":"Error trying to making a connection to the node BB971F70942A844 ******.125:3000: dial tcp ******:3000: i/o timeout", "source":"aerospike"}

With increased timeouts(was 25ms to 50ms) on the new version, using the following config

    clientPolicy := aerospike.NewClientPolicy()
    clientPolicy.IdleTimeout = 15*time.Second
    clientPolicy.Timeout = 50*time.Millisecond
    clientPolicy.ConnectionQueueSize = 1024

    readPolicy := aerospike.NewPolicy()
    readPolicy.TotalTimeout = 50*time.Millisecond
    readPolicy.SocketTimeout = 50*time.Millisecond
    readPolicy.MaxRetries = 1

    writePolicy := aerospike.NewWritePolicy(0, uint32(a.Ttl.Seconds()))
    writePolicy.TotalTimeout = 50*time.Millisecond
    writePolicy.SocketTimeout = 50*time.Millisecond

30k rps per client instance for now

Admin> show latency
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~read Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
               Node                 Time   Ops/Sec   >1Ms   >8Ms   >64Ms
                  .                 Span         .      .      .       .
***:3000   09:01:28->09:01:38   33538.5   1.35    0.0     0.0
***:3000   09:01:26->09:01:36   34697.4   1.39    0.0     0.0
***:3000   09:01:26->09:01:36   35405.3   1.47    0.0     0.0
***:3000    09:01:32->09:01:42   35304.6    1.4    0.0     0.0
***:3000    09:01:32->09:01:42   33464.2   1.42    0.0     0.0
***:3000    09:01:24->09:01:34   35163.5   1.38    0.0     0.0
Number of rows: 6

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~write Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
               Node                 Time   Ops/Sec   >1Ms   >8Ms   >64Ms
                  .                 Span         .      .      .       .
***:3000   09:01:28->09:01:38   10352.1   2.21   0.03     0.0
***:3000   09:01:26->09:01:36   10607.5   2.53   0.19    0.02
***:3000   09:01:26->09:01:36   10931.2   2.96   0.01     0.0
***:3000    09:01:32->09:01:42   10756.0   2.42    0.0     0.0
***:3000    09:01:32->09:01:42   10076.3   1.85    0.0     0.0
***:3000    09:01:24->09:01:34   10771.5   2.23   0.03    0.01
Number of rows: 6

Is 50ms too low to establish a connection for the first time?

khaf commented 5 years ago

Yes 50ms can be too low for establishing a new connection (controlled via ClientPolicy.Timeout), since the client is connecting to many nodes and trying to read their partition info at the same time, and it takes only one node to time out.

I Believe your ClientPolicy.Timeout and ClientPolicy.IdleTimeout values are way too aggressive. Those values should be at least 1 second, if not longer.

un000 commented 5 years ago

Well, errors have gone away after enabling 2s timeouts. Read latency became lower on 2.0. image

un000 commented 5 years ago

New errors: EOF and Broken Pipe image

khaf commented 5 years ago

Is having so many timeouts normal?! 150,000 per minute?

khaf commented 5 years ago

I also need a bit more specifics regarding timeouts. Which timeouts do you mean? ClientPolicy.Timeout controls tends and forming the cluster. Policy.Timeout and WritePolicy.Timeout control specific type of transactions. Which one of these were changed?

un000 commented 5 years ago

Is having so many timeouts normal?! 150,000 per minute? At least it works :D But not normal.

New config:

    clientPolicy := aerospike.NewClientPolicy()
    clientPolicy.IdleTimeout = 45*time.Second
    clientPolicy.Timeout = 25*time.Millisecond
    clientPolicy.ConnectionQueueSize = 3072

    readPolicy := aerospike.NewPolicy()
    readPolicy.TotalTimeout = 25*time.Millisecond
    readPolicy.SocketTimeout = 25*time.Millisecond
    readPolicy.MaxRetries = 1

    writePolicy := aerospike.NewWritePolicy(0, uint32(a.Ttl.Seconds()))
    writePolicy.TotalTimeout = 25*time.Millisecond
    writePolicy.SocketTimeout = 25*time.Millisecond

Maybe the problem is that client dismises connection, when one error happens?

khaf commented 5 years ago

Do the following values fit into your SLA?

    clientPolicy := aerospike.NewClientPolicy()
    clientPolicy.IdleTimeout = 15*time.Second
    clientPolicy.Timeout = 5 * time.Second // This value is NOT used in any transactions
    clientPolicy.ConnectionQueueSize = 256 // 3072 is WAY too many. You should be fine with as few as 64, unless you use a lot of Batch or Scan/Query

    readPolicy := aerospike.NewPolicy()
    readPolicy.TotalTimeout = 50*time.Millisecond
    readPolicy.SocketTimeout = 50*time.Millisecond
    readPolicy.MaxRetries = 1

    writePolicy := aerospike.NewWritePolicy(0, uint32(a.Ttl.Seconds()))
    writePolicy.TotalTimeout = 50*time.Millisecond
    writePolicy.SocketTimeout = 25*time.Millisecond

150K timeouts per second is absolutely huge. You should focus on finding the cause to a situation like that.

un000 commented 5 years ago

Traces look good image

image

khaf commented 5 years ago

One thing that comes to my mind is that this may be due to the client dropping nodes due to a very low ClientPolicy.Timeout value. Set it to 5 seconds and see if you see a big change to the number of timeouts.

un000 commented 5 years ago

@khaf ty, yes, they do. I will try your config tomorrow.

We increased the number of connections because of too many NO_AVAILABLE_CONNECTIONS_TO_NODE

khaf commented 5 years ago

Just another quick question: Were you seeing the same number of timeouts with the old client version?

un000 commented 5 years ago

I introduced a new metric to count them. I can try it with the older version to compare in a/b. We have been seeing that errors anyway.

khaf commented 5 years ago

Yes I think that would be a good idea. I also see that node.getConnectionWithHint latency is also very high, which probably means you have a connection churn problem that leads to getConnection creating a new connection to the node constantly for most transactions.

un000 commented 5 years ago

something went wrong with this config image

khaf commented 5 years ago

Yes, it does not make sense. I need to think a bit about this.

un000 commented 5 years ago

By a synthetic benchmark, there is no difference between versions. I think the problem is with the network.

khaf commented 5 years ago

Can you also try the benchmarks with v1.39.0? No difference there as well?

un000 commented 5 years ago

1 Fat record ~ blocksize

v1.39.0(total_timeout is the timeout for the BasePolicy)
$ /tmp/benito --threads=30 --requests=50000 --client_timeout=10s --max_retries=1 --socket_timeout=50ms --total_timeout=100ms
2019/04/04 16:57:40 Benching [aerospike]...
2019/04/04 16:57:43 rps: 735, done 2199...
2019/04/04 16:57:46 rps: 729, done 4407...
2019/04/04 16:57:49 rps: 759, done 6643...
2019/04/04 16:57:52 rps: 716, done 8826...
2019/04/04 16:57:55 rps: 758, done 11071...
2019/04/04 16:57:58 rps: 733, done 13259...
2019/04/04 16:58:01 rps: 721, done 15459...
2019/04/04 16:58:04 rps: 720, done 17581...
2019/04/04 16:58:07 rps: 740, done 19809...
2019/04/04 16:58:10 rps: 758, done 22049...
2019/04/04 16:58:13 rps: 748, done 24291...
2019/04/04 16:58:16 rps: 781, done 26612...
2019/04/04 16:58:19 rps: 742, done 28881...
2019/04/04 16:58:22 rps: 743, done 31099...
2019/04/04 16:58:25 rps: 722, done 33269...
2019/04/04 16:58:28 rps: 745, done 35515...
2019/04/04 16:58:31 rps: 730, done 37702...
2019/04/04 16:58:34 rps: 738, done 39946...
2019/04/04 16:58:37 rps: 745, done 42189...
2019/04/04 16:58:40 rps: 742, done 44424...
2019/04/04 16:58:43 rps: 756, done 46700...
2019/04/04 16:58:46 rps: 730, done 48929...
2019/04/04 16:58:48 done 50030.
2019/04/04 16:58:48 timeout errors 711
2019/04/04 16:58:48 other errors 0
2019/04/04 16:58:48 0.500000th: 40.409219ms
2019/04/04 16:58:48 0.750000th: 42.895522ms
2019/04/04 16:58:48 0.900000th: 45.213426ms
2019/04/04 16:58:48 0.950000th: 47.138989ms
2019/04/04 16:58:48 0.990000th: 50.090118ms
2019/04/04 16:58:48 0.999000th: 50.608631ms
2019/04/04 16:58:48 1.000000th: 79.415351ms
v2.0.0
$ /tmp/benito --threads=30 --requests=50000 --client_timeout=10s --max_retries=1 --socket_timeout=50ms --total_timeout=100ms
2019/04/04 16:56:20 Benching [aerospike]...
2019/04/04 16:56:23 rps: 797, done 2429...
2019/04/04 16:56:26 rps: 767, done 4732...
2019/04/04 16:56:29 rps: 775, done 7056...
2019/04/04 16:56:32 rps: 787, done 9390...
2019/04/04 16:56:35 rps: 784, done 11763...
2019/04/04 16:56:38 rps: 792, done 14142...
2019/04/04 16:56:41 rps: 797, done 16540...
2019/04/04 16:56:44 rps: 801, done 18936...
2019/04/04 16:56:47 rps: 777, done 21295...
2019/04/04 16:56:50 rps: 787, done 23644...
2019/04/04 16:56:53 rps: 772, done 26008...
2019/04/04 16:56:56 rps: 737, done 28270...
2019/04/04 16:56:59 rps: 781, done 30593...
2019/04/04 16:57:02 rps: 747, done 32850...
2019/04/04 16:57:05 rps: 747, done 35100...
2019/04/04 16:57:08 rps: 722, done 37331...
2019/04/04 16:57:11 rps: 738, done 39520...
2019/04/04 16:57:14 rps: 736, done 41745...
2019/04/04 16:57:17 rps: 750, done 43975...
2019/04/04 16:57:20 rps: 764, done 46265...
2019/04/04 16:57:23 rps: 774, done 48599...
2019/04/04 16:57:25 done 50030.
2019/04/04 16:57:25 timeout errors 122
2019/04/04 16:57:25 other errors 0
2019/04/04 16:57:25 0.500000th: 38.637913ms
2019/04/04 16:57:25 0.750000th: 41.114454ms
2019/04/04 16:57:25 0.900000th: 43.551327ms
2019/04/04 16:57:25 0.950000th: 45.420969ms
2019/04/04 16:57:25 0.990000th: 51.050058ms
2019/04/04 16:57:25 0.999000th: 84.91755ms
2019/04/04 16:57:25 1.000000th: 100.124861ms

Timeout errors vary between runs, but the order similar

bench is here https://github.com/un000/benito

khaf commented 5 years ago

Did you make any progress on this issue? I held off a release yesterday to see if we can get to the bottom of this.

un000 commented 5 years ago

So, the problem is when we have fat record ~128kb size timeouts happen, if < 1kb - they don't. It's needed to change somehow timeouts for this case.

un000 commented 5 years ago

Results are not good, watch below.

v2.0.0

small record: image

  "cluster-aggregated-stats": {
    "closed-connections": 0,
    "connections-attempts": 14,
    "connections-failed": 0,
    "connections-pool-empty": 0,
    "connections-successful": 14,
    "node-added-count": 10,
    "node-removed-count": 4,
    "open-connections": 14,
    "partition-map-updates": 6,
    "tends-failed": 0,
    "tends-successful": 142,
    "tends-total": 142
  },
  "open-connections": 14

fat record: image

  "cluster-aggregated-stats": {
    "closed-connections": 5,
    "connections-attempts": 41,
    "connections-failed": 5,
    "connections-pool-empty": 0,
    "connections-successful": 41,
    "node-added-count": 7,
    "node-removed-count": 1,
    "open-connections": 36,
    "partition-map-updates": 6,
    "tends-failed": 0,
    "tends-successful": 198,
    "tends-total": 198
  },
  "open-connections": 36

image


v1.39.0

small record: image

  "cluster-aggregated-stats": {
    "closed-connections": 0,
    "connections-attempts": 11,
    "connections-failed": 0,
    "connections-pool-empty": 0,
    "connections-successful": 11,
    "node-added-count": 10,
    "node-removed-count": 4,
    "open-connections": 11,
    "partition-map-updates": 6,
    "tends-failed": 0,
    "tends-successful": 141,
    "tends-total": 141
  },
  "open-connections": 11

fat record: image

  "cluster-aggregated-stats": {
    "closed-connections": 3,
    "connections-attempts": 39,
    "connections-failed": 3,
    "connections-pool-empty": 0,
    "connections-successful": 39,
    "node-added-count": 10,
    "node-removed-count": 4,
    "open-connections": 36,
    "partition-map-updates": 6,
    "tends-failed": 0,
    "tends-successful": 141,
    "tends-total": 141
  },
  "open-connections": 36

image

khaf commented 5 years ago

This is great! Thank you for a very thorough investigation. What seems odd to me is the performance spread for the 'fat' records on the older version. The v2 is more consistent, while the older version seems snappier over all. What is the machine you are testing on? Is client or the server on the cloud or a VM? Can we write off this spread to normal test variation or is it consistent over multiple runs?

un000 commented 5 years ago

I think the main difference is that the SocketTimeout's behaviour differs between versions.

I tested it on the bare metal, client and server are the same configuration in the one local network, AS configured with a persistence to ssd:

Model name:            Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
CPU(s):                32

Interfaces are paired with bond 2GB:

BONDING_OPTS="mode=802.3ad miimon=100 xmit_hash_policy=layer3+4"

The spread is consistent between runs(number of closed-connections and connections-failed nearly the same).

khaf commented 5 years ago

I think the main difference is that the SocketTimeout's behaviour differs between versions.

This may be the case, depending on how the runtime sets the timeouts. Before v2, we just set the timeout once and the client would try to do everything within that boundary. In v2, it's a rolling timeout for each read or write so that the client would minimize the chances of a timeout. But it seems like there is a side effect in real world use, seemingly due to setting the timeout on the connection counting as an io op and a goroutine yield. This does not show up in my test setups, but it can the culprit.

Would be open to testing this hypothesis by using a modified source code for the client locally? I can share a gist.

khaf commented 5 years ago

here you go: https://gist.github.com/khaf/6f6bf72575e8decdd5b0d58ec93e0424

Just replace the contents of the connection.go in v2 with the above gist.

un000 commented 5 years ago

before: image

after: image

Also without the patch RPS was higher(900RPS vs 600RPS max)

Damn, timings are inconsistent between each other

khaf commented 5 years ago

You sure about the before and after? The after looks like before from the prior report.

un000 commented 5 years ago

I've run the bench from 2 hosts in parallel.

before patch with the half of peak rps: image

$ /tmp/benito --threads=30 --requests=100000 --client_timeout=10s --max_retries=1 --socket_timeout=50ms --total_timeout=100ms --rps=400 --key=4e
2019/04/05 15:04:32 Benching [aerospike]...
2019/04/05 15:04:35 rps: 402, done 1197...
2019/04/05 15:04:38 rps: 395, done 2395...
2019/04/05 15:04:41 rps: 397, done 3593...
2019/04/05 15:04:44 rps: 398, done 4794...
2019/04/05 15:04:47 rps: 398, done 5991...
2019/04/05 15:04:50 rps: 399, done 7190...
2019/04/05 15:04:53 rps: 398, done 8389...
2019/04/05 15:04:56 rps: 396, done 9586...
2019/04/05 15:04:59 rps: 396, done 10782...
2019/04/05 15:05:02 rps: 398, done 11982...
2019/04/05 15:05:05 rps: 396, done 13180...
2019/04/05 15:05:08 rps: 398, done 14380...
2019/04/05 15:05:11 rps: 398, done 15574...
2019/04/05 15:05:14 rps: 398, done 16770...
2019/04/05 15:05:17 rps: 397, done 17970...
2019/04/05 15:05:20 rps: 398, done 19169...
2019/04/05 15:05:23 rps: 396, done 20369...
2019/04/05 15:05:26 rps: 397, done 21569...
2019/04/05 15:05:29 rps: 397, done 22767...
2019/04/05 15:05:32 rps: 397, done 23966...
2019/04/05 15:05:35 rps: 399, done 25166...
2019/04/05 15:05:38 rps: 397, done 26363...
2019/04/05 15:05:41 rps: 398, done 27561...
2019/04/05 15:05:44 rps: 397, done 28760...
2019/04/05 15:05:47 rps: 397, done 29960...
2019/04/05 15:05:50 rps: 398, done 31158...
2019/04/05 15:05:53 rps: 397, done 32359...
2019/04/05 15:05:56 rps: 397, done 33556...
2019/04/05 15:05:59 rps: 397, done 34756...
2019/04/05 15:06:02 rps: 398, done 35957...
2019/04/05 15:06:05 rps: 398, done 37157...
2019/04/05 15:06:08 rps: 396, done 38352...
2019/04/05 15:06:11 rps: 396, done 39549...
2019/04/05 15:06:14 rps: 399, done 40749...
2019/04/05 15:06:17 rps: 395, done 41945...
2019/04/05 15:06:20 rps: 396, done 43143...
2019/04/05 15:06:23 rps: 399, done 44341...
2019/04/05 15:06:26 rps: 398, done 45539...
2019/04/05 15:06:29 rps: 398, done 46737...
2019/04/05 15:06:32 rps: 396, done 47933...
2019/04/05 15:06:35 rps: 398, done 49132...
2019/04/05 15:06:38 rps: 398, done 50330...
2019/04/05 15:06:41 rps: 398, done 51527...
2019/04/05 15:06:44 rps: 398, done 52727...
2019/04/05 15:06:47 rps: 398, done 53928...
2019/04/05 15:06:50 rps: 398, done 55128...
2019/04/05 15:06:53 rps: 398, done 56326...
2019/04/05 15:06:56 rps: 398, done 57525...
2019/04/05 15:06:59 rps: 398, done 58722...
2019/04/05 15:07:02 rps: 397, done 59920...
2019/04/05 15:07:05 rps: 398, done 61118...
2019/04/05 15:07:08 rps: 396, done 62316...
2019/04/05 15:07:11 rps: 398, done 63514...
2019/04/05 15:07:14 rps: 398, done 64714...
2019/04/05 15:07:17 rps: 398, done 65914...
2019/04/05 15:07:20 rps: 398, done 67112...
2019/04/05 15:07:23 rps: 397, done 68311...
2019/04/05 15:07:26 rps: 397, done 69509...
2019/04/05 15:07:29 rps: 398, done 70708...
2019/04/05 15:07:32 rps: 396, done 71905...
2019/04/05 15:07:35 rps: 398, done 73103...
2019/04/05 15:07:38 rps: 398, done 74302...
2019/04/05 15:07:41 rps: 399, done 75502...
2019/04/05 15:07:44 rps: 396, done 76698...
2019/04/05 15:07:47 rps: 396, done 77895...
2019/04/05 15:07:50 rps: 398, done 79094...
2019/04/05 15:07:53 rps: 396, done 80292...
2019/04/05 15:07:56 rps: 398, done 81492...
2019/04/05 15:07:59 rps: 398, done 82690...
2019/04/05 15:08:02 rps: 396, done 83885...
2019/04/05 15:08:05 rps: 398, done 85085...
2019/04/05 15:08:08 rps: 398, done 86285...
2019/04/05 15:08:11 rps: 398, done 87483...
2019/04/05 15:08:14 rps: 398, done 88683...
2019/04/05 15:08:17 rps: 396, done 89883...
2019/04/05 15:08:20 rps: 396, done 91081...
2019/04/05 15:08:23 rps: 399, done 92279...
2019/04/05 15:08:26 rps: 399, done 93479...
2019/04/05 15:08:29 rps: 397, done 94678...
2019/04/05 15:08:32 rps: 397, done 95879...
2019/04/05 15:08:35 rps: 397, done 97077...
2019/04/05 15:08:38 rps: 398, done 98275...
2019/04/05 15:08:41 rps: 398, done 99474...
2019/04/05 15:08:42 done 100000.
2019/04/05 15:08:42 timeout errors 0
2019/04/05 15:08:42 other errors 0
2019/04/05 15:08:42 0.500000th: 1.934276ms
2019/04/05 15:08:42 0.750000th: 2.516135ms
2019/04/05 15:08:42 0.900000th: 4.253012ms
2019/04/05 15:08:42 0.950000th: 5.731489ms
2019/04/05 15:08:42 0.990000th: 13.340903ms
2019/04/05 15:08:42 0.999000th: 25.990546ms
2019/04/05 15:08:42 1.000000th: 26.372503ms
2019/04/05 15:08:42
{
  "cluster-aggregated-stats": {
    "closed-connections": 25,
    "connections-attempts": 37,
    "connections-failed": 0,
    "connections-pool-empty": 0,
    "connections-successful": 37,
    "node-added-count": 11,
    "node-removed-count": 5,
    "open-connections": 12,
    "partition-map-updates": 6,
    "tends-failed": 0,
    "tends-successful": 1502,
    "tends-total": 1502
  },
  "open-connections": 12
}

after the patch: image

$ /tmp/benito --threads=30 --requests=100000 --client_timeout=10s --max_retries=1 --socket_timeout=50ms --total_timeout=100ms --rps=400 --key=4e
2019/04/05 15:04:33 Benching [aerospike]...
2019/04/05 15:04:36 rps: 404, done 1199...
2019/04/05 15:04:39 rps: 398, done 2398...
2019/04/05 15:04:42 rps: 398, done 3595...
2019/04/05 15:04:45 rps: 398, done 4795...
2019/04/05 15:04:48 rps: 398, done 5994...
2019/04/05 15:04:51 rps: 398, done 7192...
2019/04/05 15:04:54 rps: 398, done 8392...
2019/04/05 15:04:57 rps: 399, done 9591...
2019/04/05 15:05:00 rps: 397, done 10791...
2019/04/05 15:05:03 rps: 395, done 11991...
2019/04/05 15:05:06 rps: 398, done 13187...
2019/04/05 15:05:09 rps: 397, done 14387...
2019/04/05 15:05:12 rps: 398, done 15584...
2019/04/05 15:05:15 rps: 397, done 16783...
2019/04/05 15:05:18 rps: 397, done 17979...
2019/04/05 15:05:21 rps: 398, done 19177...
2019/04/05 15:05:24 rps: 395, done 20377...
2019/04/05 15:05:27 rps: 397, done 21574...
2019/04/05 15:05:30 rps: 399, done 22774...
2019/04/05 15:05:33 rps: 397, done 23971...
2019/04/05 15:05:36 rps: 397, done 25169...
2019/04/05 15:05:39 rps: 398, done 26369...
2019/04/05 15:05:42 rps: 395, done 27568...
2019/04/05 15:05:45 rps: 398, done 28767...
2019/04/05 15:05:48 rps: 397, done 29966...
2019/04/05 15:05:51 rps: 396, done 31163...
2019/04/05 15:05:54 rps: 396, done 32363...
2019/04/05 15:05:57 rps: 399, done 33559...
2019/04/05 15:06:00 rps: 398, done 34759...
2019/04/05 15:06:03 rps: 397, done 35957...
2019/04/05 15:06:06 rps: 397, done 37157...
2019/04/05 15:06:09 rps: 398, done 38356...
2019/04/05 15:06:12 rps: 398, done 39556...
2019/04/05 15:06:15 rps: 398, done 40755...
2019/04/05 15:06:18 rps: 397, done 41955...
2019/04/05 15:06:21 rps: 398, done 43154...
2019/04/05 15:06:24 rps: 398, done 44353...
2019/04/05 15:06:27 rps: 396, done 45549...
2019/04/05 15:06:30 rps: 398, done 46748...
2019/04/05 15:06:33 rps: 397, done 47945...
2019/04/05 15:06:36 rps: 398, done 49142...
2019/04/05 15:06:39 rps: 396, done 50342...
2019/04/05 15:06:42 rps: 395, done 51540...
2019/04/05 15:06:45 rps: 397, done 52740...
2019/04/05 15:06:48 rps: 398, done 53940...
2019/04/05 15:06:51 rps: 398, done 55138...
2019/04/05 15:06:54 rps: 396, done 56337...
2019/04/05 15:06:57 rps: 397, done 57536...
2019/04/05 15:07:00 rps: 397, done 58736...
2019/04/05 15:07:03 rps: 397, done 59932...
2019/04/05 15:07:06 rps: 397, done 61131...
2019/04/05 15:07:09 rps: 396, done 62328...
2019/04/05 15:07:12 rps: 397, done 63527...
2019/04/05 15:07:15 rps: 397, done 64726...
2019/04/05 15:07:18 rps: 398, done 65926...
2019/04/05 15:07:21 rps: 397, done 67124...
2019/04/05 15:07:24 rps: 397, done 68325...
2019/04/05 15:07:27 rps: 397, done 69523...
2019/04/05 15:07:30 rps: 394, done 70720...
2019/04/05 15:07:33 rps: 397, done 71919...
2019/04/05 15:07:36 rps: 398, done 73118...
2019/04/05 15:07:39 rps: 397, done 74317...
2019/04/05 15:07:42 rps: 396, done 75513...
2019/04/05 15:07:45 rps: 395, done 76711...
2019/04/05 15:07:48 rps: 396, done 77910...
2019/04/05 15:07:51 rps: 396, done 79106...
2019/04/05 15:07:54 rps: 395, done 80305...
2019/04/05 15:07:57 rps: 397, done 81504...
2019/04/05 15:08:00 rps: 397, done 82703...
2019/04/05 15:08:03 rps: 397, done 83900...
2019/04/05 15:08:06 rps: 395, done 85099...
2019/04/05 15:08:09 rps: 398, done 86299...
2019/04/05 15:08:12 rps: 397, done 87495...
2019/04/05 15:08:15 rps: 397, done 88695...
2019/04/05 15:08:18 rps: 398, done 89894...
2019/04/05 15:08:21 rps: 398, done 91094...
2019/04/05 15:08:24 rps: 395, done 92292...
2019/04/05 15:08:27 rps: 397, done 93491...
2019/04/05 15:08:30 rps: 397, done 94691...
2019/04/05 15:08:33 rps: 397, done 95890...
2019/04/05 15:08:36 rps: 399, done 97089...
2019/04/05 15:08:39 rps: 396, done 98289...
2019/04/05 15:08:42 rps: 396, done 99486...
2019/04/05 15:08:43 done 100000.
2019/04/05 15:08:43 timeout errors 0
2019/04/05 15:08:43 other errors 0
2019/04/05 15:08:43 0.500000th: 2.073755ms
2019/04/05 15:08:43 0.750000th: 2.993537ms
2019/04/05 15:08:43 0.900000th: 5.120867ms
2019/04/05 15:08:43 0.950000th: 7.392001ms
2019/04/05 15:08:43 0.990000th: 14.158231ms
2019/04/05 15:08:43 0.999000th: 33.97195ms
2019/04/05 15:08:43 1.000000th: 37.474278ms
2019/04/05 15:08:43
{
  "cluster-aggregated-stats": {
    "closed-connections": 19,
    "connections-attempts": 32,
    "connections-failed": 0,
    "connections-pool-empty": 0,
    "connections-successful": 32,
    "node-added-count": 9,
    "node-removed-count": 3,
    "open-connections": 13,
    "partition-map-updates": 6,
    "tends-failed": 0,
    "tends-successful": 1506,
    "tends-total": 1506
  },
  "open-connections": 13
}
un000 commented 5 years ago

By the same methodology the new version is better!

v2.0.0: image

  "cluster-aggregated-stats": {
    "closed-connections": 16,
    "connections-attempts": 27,
    "connections-failed": 0,
    "connections-pool-empty": 0,
    "connections-successful": 27,
    "node-added-count": 10,
    "node-removed-count": 4,
    "open-connections": 11,
    "partition-map-updates": 6,
    "tends-failed": 0,
    "tends-successful": 1503,
    "tends-total": 1503
  },
  "open-connections": 11

v1.39.0: image

  "cluster-aggregated-stats": {
    "closed-connections": 17,
    "connections-attempts": 29,
    "connections-failed": 0,
    "connections-pool-empty": 0,
    "connections-successful": 29,
    "node-added-count": 11,
    "node-removed-count": 5,
    "open-connections": 12,
    "partition-map-updates": 6,
    "tends-failed": 0,
    "tends-successful": 1503,
    "tends-total": 1503
  },
  "open-connections": 12
un000 commented 5 years ago

Thank you for your help, I think the problem is not in the new release. We need to fight against big records on a high rps.

Connections pre-warmup will be a great feature. 👍

un000 commented 5 years ago

Read latencies became better image

Write latencies didn't change at all

Thank You!

khaf commented 5 years ago

Great! Do you still see the io.EOF errors? Did the Warm Up featute help?

un000 commented 5 years ago

Well, I don't see increased error rates on deploy, so warmup helped a lot.

EOF aggregated from all servers(rpm): image

Other errors(rpm): image