Closed un000 closed 5 years ago
Coming next release.
Also, we can't update to 2.0 because of timeout errors:
03:49:11 [50154]: {"level":"info","ts":"2019-04-04T03:49:11.206-0500","caller":"aerospike-logger/aerospike_logger.go:18","msg":"Error trying to making a connection to the node BB971F70942A844 ******.125:3000: dial tcp ******:3000: i/o timeout", "source":"aerospike"}
With increased timeouts(was 25ms to 50ms) on the new version, using the following config
clientPolicy := aerospike.NewClientPolicy()
clientPolicy.IdleTimeout = 15*time.Second
clientPolicy.Timeout = 50*time.Millisecond
clientPolicy.ConnectionQueueSize = 1024
readPolicy := aerospike.NewPolicy()
readPolicy.TotalTimeout = 50*time.Millisecond
readPolicy.SocketTimeout = 50*time.Millisecond
readPolicy.MaxRetries = 1
writePolicy := aerospike.NewWritePolicy(0, uint32(a.Ttl.Seconds()))
writePolicy.TotalTimeout = 50*time.Millisecond
writePolicy.SocketTimeout = 50*time.Millisecond
30k rps per client instance for now
Admin> show latency
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~read Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Time Ops/Sec >1Ms >8Ms >64Ms
. Span . . . .
***:3000 09:01:28->09:01:38 33538.5 1.35 0.0 0.0
***:3000 09:01:26->09:01:36 34697.4 1.39 0.0 0.0
***:3000 09:01:26->09:01:36 35405.3 1.47 0.0 0.0
***:3000 09:01:32->09:01:42 35304.6 1.4 0.0 0.0
***:3000 09:01:32->09:01:42 33464.2 1.42 0.0 0.0
***:3000 09:01:24->09:01:34 35163.5 1.38 0.0 0.0
Number of rows: 6
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~write Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Time Ops/Sec >1Ms >8Ms >64Ms
. Span . . . .
***:3000 09:01:28->09:01:38 10352.1 2.21 0.03 0.0
***:3000 09:01:26->09:01:36 10607.5 2.53 0.19 0.02
***:3000 09:01:26->09:01:36 10931.2 2.96 0.01 0.0
***:3000 09:01:32->09:01:42 10756.0 2.42 0.0 0.0
***:3000 09:01:32->09:01:42 10076.3 1.85 0.0 0.0
***:3000 09:01:24->09:01:34 10771.5 2.23 0.03 0.01
Number of rows: 6
Is 50ms too low to establish a connection for the first time?
Yes 50ms can be too low for establishing a new connection (controlled via ClientPolicy.Timeout
), since the client is connecting to many nodes and trying to read their partition info at the same time, and it takes only one node to time out.
I Believe your ClientPolicy.Timeout
and ClientPolicy.IdleTimeout
values are way too aggressive. Those values should be at least 1 second, if not longer.
Well, errors have gone away after enabling 2s timeouts. Read latency became lower on 2.0.
New errors: EOF and Broken Pipe
Is having so many timeouts normal?! 150,000 per minute?
I also need a bit more specifics regarding timeouts. Which timeouts do you mean? ClientPolicy.Timeout
controls tends and forming the cluster. Policy.Timeout
and WritePolicy.Timeout
control specific type of transactions. Which one of these were changed?
Is having so many timeouts normal?! 150,000 per minute? At least it works :D But not normal.
New config:
clientPolicy := aerospike.NewClientPolicy()
clientPolicy.IdleTimeout = 45*time.Second
clientPolicy.Timeout = 25*time.Millisecond
clientPolicy.ConnectionQueueSize = 3072
readPolicy := aerospike.NewPolicy()
readPolicy.TotalTimeout = 25*time.Millisecond
readPolicy.SocketTimeout = 25*time.Millisecond
readPolicy.MaxRetries = 1
writePolicy := aerospike.NewWritePolicy(0, uint32(a.Ttl.Seconds()))
writePolicy.TotalTimeout = 25*time.Millisecond
writePolicy.SocketTimeout = 25*time.Millisecond
Maybe the problem is that client dismises connection, when one error happens?
Do the following values fit into your SLA?
clientPolicy := aerospike.NewClientPolicy()
clientPolicy.IdleTimeout = 15*time.Second
clientPolicy.Timeout = 5 * time.Second // This value is NOT used in any transactions
clientPolicy.ConnectionQueueSize = 256 // 3072 is WAY too many. You should be fine with as few as 64, unless you use a lot of Batch or Scan/Query
readPolicy := aerospike.NewPolicy()
readPolicy.TotalTimeout = 50*time.Millisecond
readPolicy.SocketTimeout = 50*time.Millisecond
readPolicy.MaxRetries = 1
writePolicy := aerospike.NewWritePolicy(0, uint32(a.Ttl.Seconds()))
writePolicy.TotalTimeout = 50*time.Millisecond
writePolicy.SocketTimeout = 25*time.Millisecond
150K timeouts per second is absolutely huge. You should focus on finding the cause to a situation like that.
Traces look good
One thing that comes to my mind is that this may be due to the client dropping nodes due to a very low ClientPolicy.Timeout
value. Set it to 5 seconds and see if you see a big change to the number of timeouts.
@khaf ty, yes, they do. I will try your config tomorrow.
We increased the number of connections because of too many NO_AVAILABLE_CONNECTIONS_TO_NODE
Just another quick question: Were you seeing the same number of timeouts with the old client version?
I introduced a new metric to count them. I can try it with the older version to compare in a/b. We have been seeing that errors anyway.
Yes I think that would be a good idea. I also see that node.getConnectionWithHint
latency is also very high, which probably means you have a connection churn problem that leads to getConnection
creating a new connection to the node constantly for most transactions.
something went wrong with this config
Yes, it does not make sense. I need to think a bit about this.
By a synthetic benchmark, there is no difference between versions. I think the problem is with the network.
Can you also try the benchmarks with v1.39.0? No difference there as well?
1 Fat record ~ blocksize
$ /tmp/benito --threads=30 --requests=50000 --client_timeout=10s --max_retries=1 --socket_timeout=50ms --total_timeout=100ms 2019/04/04 16:57:40 Benching [aerospike]... 2019/04/04 16:57:43 rps: 735, done 2199... 2019/04/04 16:57:46 rps: 729, done 4407... 2019/04/04 16:57:49 rps: 759, done 6643... 2019/04/04 16:57:52 rps: 716, done 8826... 2019/04/04 16:57:55 rps: 758, done 11071... 2019/04/04 16:57:58 rps: 733, done 13259... 2019/04/04 16:58:01 rps: 721, done 15459... 2019/04/04 16:58:04 rps: 720, done 17581... 2019/04/04 16:58:07 rps: 740, done 19809... 2019/04/04 16:58:10 rps: 758, done 22049... 2019/04/04 16:58:13 rps: 748, done 24291... 2019/04/04 16:58:16 rps: 781, done 26612... 2019/04/04 16:58:19 rps: 742, done 28881... 2019/04/04 16:58:22 rps: 743, done 31099... 2019/04/04 16:58:25 rps: 722, done 33269... 2019/04/04 16:58:28 rps: 745, done 35515... 2019/04/04 16:58:31 rps: 730, done 37702... 2019/04/04 16:58:34 rps: 738, done 39946... 2019/04/04 16:58:37 rps: 745, done 42189... 2019/04/04 16:58:40 rps: 742, done 44424... 2019/04/04 16:58:43 rps: 756, done 46700... 2019/04/04 16:58:46 rps: 730, done 48929... 2019/04/04 16:58:48 done 50030. 2019/04/04 16:58:48 timeout errors 711 2019/04/04 16:58:48 other errors 0 2019/04/04 16:58:48 0.500000th: 40.409219ms 2019/04/04 16:58:48 0.750000th: 42.895522ms 2019/04/04 16:58:48 0.900000th: 45.213426ms 2019/04/04 16:58:48 0.950000th: 47.138989ms 2019/04/04 16:58:48 0.990000th: 50.090118ms 2019/04/04 16:58:48 0.999000th: 50.608631ms 2019/04/04 16:58:48 1.000000th: 79.415351ms
$ /tmp/benito --threads=30 --requests=50000 --client_timeout=10s --max_retries=1 --socket_timeout=50ms --total_timeout=100ms 2019/04/04 16:56:20 Benching [aerospike]... 2019/04/04 16:56:23 rps: 797, done 2429... 2019/04/04 16:56:26 rps: 767, done 4732... 2019/04/04 16:56:29 rps: 775, done 7056... 2019/04/04 16:56:32 rps: 787, done 9390... 2019/04/04 16:56:35 rps: 784, done 11763... 2019/04/04 16:56:38 rps: 792, done 14142... 2019/04/04 16:56:41 rps: 797, done 16540... 2019/04/04 16:56:44 rps: 801, done 18936... 2019/04/04 16:56:47 rps: 777, done 21295... 2019/04/04 16:56:50 rps: 787, done 23644... 2019/04/04 16:56:53 rps: 772, done 26008... 2019/04/04 16:56:56 rps: 737, done 28270... 2019/04/04 16:56:59 rps: 781, done 30593... 2019/04/04 16:57:02 rps: 747, done 32850... 2019/04/04 16:57:05 rps: 747, done 35100... 2019/04/04 16:57:08 rps: 722, done 37331... 2019/04/04 16:57:11 rps: 738, done 39520... 2019/04/04 16:57:14 rps: 736, done 41745... 2019/04/04 16:57:17 rps: 750, done 43975... 2019/04/04 16:57:20 rps: 764, done 46265... 2019/04/04 16:57:23 rps: 774, done 48599... 2019/04/04 16:57:25 done 50030. 2019/04/04 16:57:25 timeout errors 122 2019/04/04 16:57:25 other errors 0 2019/04/04 16:57:25 0.500000th: 38.637913ms 2019/04/04 16:57:25 0.750000th: 41.114454ms 2019/04/04 16:57:25 0.900000th: 43.551327ms 2019/04/04 16:57:25 0.950000th: 45.420969ms 2019/04/04 16:57:25 0.990000th: 51.050058ms 2019/04/04 16:57:25 0.999000th: 84.91755ms 2019/04/04 16:57:25 1.000000th: 100.124861ms
Timeout errors vary between runs, but the order similar
bench is here https://github.com/un000/benito
Did you make any progress on this issue? I held off a release yesterday to see if we can get to the bottom of this.
So, the problem is when we have fat record ~128kb size timeouts happen, if < 1kb - they don't. It's needed to change somehow timeouts for this case.
Results are not good, watch below.
v2.0.0
small record:
"cluster-aggregated-stats": {
"closed-connections": 0,
"connections-attempts": 14,
"connections-failed": 0,
"connections-pool-empty": 0,
"connections-successful": 14,
"node-added-count": 10,
"node-removed-count": 4,
"open-connections": 14,
"partition-map-updates": 6,
"tends-failed": 0,
"tends-successful": 142,
"tends-total": 142
},
"open-connections": 14
fat record:
"cluster-aggregated-stats": {
"closed-connections": 5,
"connections-attempts": 41,
"connections-failed": 5,
"connections-pool-empty": 0,
"connections-successful": 41,
"node-added-count": 7,
"node-removed-count": 1,
"open-connections": 36,
"partition-map-updates": 6,
"tends-failed": 0,
"tends-successful": 198,
"tends-total": 198
},
"open-connections": 36
v1.39.0
small record:
"cluster-aggregated-stats": {
"closed-connections": 0,
"connections-attempts": 11,
"connections-failed": 0,
"connections-pool-empty": 0,
"connections-successful": 11,
"node-added-count": 10,
"node-removed-count": 4,
"open-connections": 11,
"partition-map-updates": 6,
"tends-failed": 0,
"tends-successful": 141,
"tends-total": 141
},
"open-connections": 11
fat record:
"cluster-aggregated-stats": {
"closed-connections": 3,
"connections-attempts": 39,
"connections-failed": 3,
"connections-pool-empty": 0,
"connections-successful": 39,
"node-added-count": 10,
"node-removed-count": 4,
"open-connections": 36,
"partition-map-updates": 6,
"tends-failed": 0,
"tends-successful": 141,
"tends-total": 141
},
"open-connections": 36
This is great! Thank you for a very thorough investigation. What seems odd to me is the performance spread for the 'fat' records on the older version. The v2 is more consistent, while the older version seems snappier over all. What is the machine you are testing on? Is client or the server on the cloud or a VM? Can we write off this spread to normal test variation or is it consistent over multiple runs?
I think the main difference is that the SocketTimeout's behaviour differs between versions.
I tested it on the bare metal, client and server are the same configuration in the one local network, AS configured with a persistence to ssd:
Model name: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
CPU(s): 32
Interfaces are paired with bond 2GB:
BONDING_OPTS="mode=802.3ad miimon=100 xmit_hash_policy=layer3+4"
The spread is consistent between runs(number of closed-connections and connections-failed nearly the same).
I think the main difference is that the SocketTimeout's behaviour differs between versions.
This may be the case, depending on how the runtime sets the timeouts. Before v2, we just set the timeout once and the client would try to do everything within that boundary. In v2, it's a rolling timeout for each read or write so that the client would minimize the chances of a timeout. But it seems like there is a side effect in real world use, seemingly due to setting the timeout on the connection counting as an io op and a goroutine yield. This does not show up in my test setups, but it can the culprit.
Would be open to testing this hypothesis by using a modified source code for the client locally? I can share a gist.
here you go: https://gist.github.com/khaf/6f6bf72575e8decdd5b0d58ec93e0424
Just replace the contents of the connection.go in v2 with the above gist.
before:
after:
Also without the patch RPS was higher(900RPS vs 600RPS max)
Damn, timings are inconsistent between each other
You sure about the before and after? The after looks like before from the prior report.
I've run the bench from 2 hosts in parallel.
before patch with the half of peak rps:
$ /tmp/benito --threads=30 --requests=100000 --client_timeout=10s --max_retries=1 --socket_timeout=50ms --total_timeout=100ms --rps=400 --key=4e
2019/04/05 15:04:32 Benching [aerospike]...
2019/04/05 15:04:35 rps: 402, done 1197...
2019/04/05 15:04:38 rps: 395, done 2395...
2019/04/05 15:04:41 rps: 397, done 3593...
2019/04/05 15:04:44 rps: 398, done 4794...
2019/04/05 15:04:47 rps: 398, done 5991...
2019/04/05 15:04:50 rps: 399, done 7190...
2019/04/05 15:04:53 rps: 398, done 8389...
2019/04/05 15:04:56 rps: 396, done 9586...
2019/04/05 15:04:59 rps: 396, done 10782...
2019/04/05 15:05:02 rps: 398, done 11982...
2019/04/05 15:05:05 rps: 396, done 13180...
2019/04/05 15:05:08 rps: 398, done 14380...
2019/04/05 15:05:11 rps: 398, done 15574...
2019/04/05 15:05:14 rps: 398, done 16770...
2019/04/05 15:05:17 rps: 397, done 17970...
2019/04/05 15:05:20 rps: 398, done 19169...
2019/04/05 15:05:23 rps: 396, done 20369...
2019/04/05 15:05:26 rps: 397, done 21569...
2019/04/05 15:05:29 rps: 397, done 22767...
2019/04/05 15:05:32 rps: 397, done 23966...
2019/04/05 15:05:35 rps: 399, done 25166...
2019/04/05 15:05:38 rps: 397, done 26363...
2019/04/05 15:05:41 rps: 398, done 27561...
2019/04/05 15:05:44 rps: 397, done 28760...
2019/04/05 15:05:47 rps: 397, done 29960...
2019/04/05 15:05:50 rps: 398, done 31158...
2019/04/05 15:05:53 rps: 397, done 32359...
2019/04/05 15:05:56 rps: 397, done 33556...
2019/04/05 15:05:59 rps: 397, done 34756...
2019/04/05 15:06:02 rps: 398, done 35957...
2019/04/05 15:06:05 rps: 398, done 37157...
2019/04/05 15:06:08 rps: 396, done 38352...
2019/04/05 15:06:11 rps: 396, done 39549...
2019/04/05 15:06:14 rps: 399, done 40749...
2019/04/05 15:06:17 rps: 395, done 41945...
2019/04/05 15:06:20 rps: 396, done 43143...
2019/04/05 15:06:23 rps: 399, done 44341...
2019/04/05 15:06:26 rps: 398, done 45539...
2019/04/05 15:06:29 rps: 398, done 46737...
2019/04/05 15:06:32 rps: 396, done 47933...
2019/04/05 15:06:35 rps: 398, done 49132...
2019/04/05 15:06:38 rps: 398, done 50330...
2019/04/05 15:06:41 rps: 398, done 51527...
2019/04/05 15:06:44 rps: 398, done 52727...
2019/04/05 15:06:47 rps: 398, done 53928...
2019/04/05 15:06:50 rps: 398, done 55128...
2019/04/05 15:06:53 rps: 398, done 56326...
2019/04/05 15:06:56 rps: 398, done 57525...
2019/04/05 15:06:59 rps: 398, done 58722...
2019/04/05 15:07:02 rps: 397, done 59920...
2019/04/05 15:07:05 rps: 398, done 61118...
2019/04/05 15:07:08 rps: 396, done 62316...
2019/04/05 15:07:11 rps: 398, done 63514...
2019/04/05 15:07:14 rps: 398, done 64714...
2019/04/05 15:07:17 rps: 398, done 65914...
2019/04/05 15:07:20 rps: 398, done 67112...
2019/04/05 15:07:23 rps: 397, done 68311...
2019/04/05 15:07:26 rps: 397, done 69509...
2019/04/05 15:07:29 rps: 398, done 70708...
2019/04/05 15:07:32 rps: 396, done 71905...
2019/04/05 15:07:35 rps: 398, done 73103...
2019/04/05 15:07:38 rps: 398, done 74302...
2019/04/05 15:07:41 rps: 399, done 75502...
2019/04/05 15:07:44 rps: 396, done 76698...
2019/04/05 15:07:47 rps: 396, done 77895...
2019/04/05 15:07:50 rps: 398, done 79094...
2019/04/05 15:07:53 rps: 396, done 80292...
2019/04/05 15:07:56 rps: 398, done 81492...
2019/04/05 15:07:59 rps: 398, done 82690...
2019/04/05 15:08:02 rps: 396, done 83885...
2019/04/05 15:08:05 rps: 398, done 85085...
2019/04/05 15:08:08 rps: 398, done 86285...
2019/04/05 15:08:11 rps: 398, done 87483...
2019/04/05 15:08:14 rps: 398, done 88683...
2019/04/05 15:08:17 rps: 396, done 89883...
2019/04/05 15:08:20 rps: 396, done 91081...
2019/04/05 15:08:23 rps: 399, done 92279...
2019/04/05 15:08:26 rps: 399, done 93479...
2019/04/05 15:08:29 rps: 397, done 94678...
2019/04/05 15:08:32 rps: 397, done 95879...
2019/04/05 15:08:35 rps: 397, done 97077...
2019/04/05 15:08:38 rps: 398, done 98275...
2019/04/05 15:08:41 rps: 398, done 99474...
2019/04/05 15:08:42 done 100000.
2019/04/05 15:08:42 timeout errors 0
2019/04/05 15:08:42 other errors 0
2019/04/05 15:08:42 0.500000th: 1.934276ms
2019/04/05 15:08:42 0.750000th: 2.516135ms
2019/04/05 15:08:42 0.900000th: 4.253012ms
2019/04/05 15:08:42 0.950000th: 5.731489ms
2019/04/05 15:08:42 0.990000th: 13.340903ms
2019/04/05 15:08:42 0.999000th: 25.990546ms
2019/04/05 15:08:42 1.000000th: 26.372503ms
2019/04/05 15:08:42
{
"cluster-aggregated-stats": {
"closed-connections": 25,
"connections-attempts": 37,
"connections-failed": 0,
"connections-pool-empty": 0,
"connections-successful": 37,
"node-added-count": 11,
"node-removed-count": 5,
"open-connections": 12,
"partition-map-updates": 6,
"tends-failed": 0,
"tends-successful": 1502,
"tends-total": 1502
},
"open-connections": 12
}
after the patch:
$ /tmp/benito --threads=30 --requests=100000 --client_timeout=10s --max_retries=1 --socket_timeout=50ms --total_timeout=100ms --rps=400 --key=4e
2019/04/05 15:04:33 Benching [aerospike]...
2019/04/05 15:04:36 rps: 404, done 1199...
2019/04/05 15:04:39 rps: 398, done 2398...
2019/04/05 15:04:42 rps: 398, done 3595...
2019/04/05 15:04:45 rps: 398, done 4795...
2019/04/05 15:04:48 rps: 398, done 5994...
2019/04/05 15:04:51 rps: 398, done 7192...
2019/04/05 15:04:54 rps: 398, done 8392...
2019/04/05 15:04:57 rps: 399, done 9591...
2019/04/05 15:05:00 rps: 397, done 10791...
2019/04/05 15:05:03 rps: 395, done 11991...
2019/04/05 15:05:06 rps: 398, done 13187...
2019/04/05 15:05:09 rps: 397, done 14387...
2019/04/05 15:05:12 rps: 398, done 15584...
2019/04/05 15:05:15 rps: 397, done 16783...
2019/04/05 15:05:18 rps: 397, done 17979...
2019/04/05 15:05:21 rps: 398, done 19177...
2019/04/05 15:05:24 rps: 395, done 20377...
2019/04/05 15:05:27 rps: 397, done 21574...
2019/04/05 15:05:30 rps: 399, done 22774...
2019/04/05 15:05:33 rps: 397, done 23971...
2019/04/05 15:05:36 rps: 397, done 25169...
2019/04/05 15:05:39 rps: 398, done 26369...
2019/04/05 15:05:42 rps: 395, done 27568...
2019/04/05 15:05:45 rps: 398, done 28767...
2019/04/05 15:05:48 rps: 397, done 29966...
2019/04/05 15:05:51 rps: 396, done 31163...
2019/04/05 15:05:54 rps: 396, done 32363...
2019/04/05 15:05:57 rps: 399, done 33559...
2019/04/05 15:06:00 rps: 398, done 34759...
2019/04/05 15:06:03 rps: 397, done 35957...
2019/04/05 15:06:06 rps: 397, done 37157...
2019/04/05 15:06:09 rps: 398, done 38356...
2019/04/05 15:06:12 rps: 398, done 39556...
2019/04/05 15:06:15 rps: 398, done 40755...
2019/04/05 15:06:18 rps: 397, done 41955...
2019/04/05 15:06:21 rps: 398, done 43154...
2019/04/05 15:06:24 rps: 398, done 44353...
2019/04/05 15:06:27 rps: 396, done 45549...
2019/04/05 15:06:30 rps: 398, done 46748...
2019/04/05 15:06:33 rps: 397, done 47945...
2019/04/05 15:06:36 rps: 398, done 49142...
2019/04/05 15:06:39 rps: 396, done 50342...
2019/04/05 15:06:42 rps: 395, done 51540...
2019/04/05 15:06:45 rps: 397, done 52740...
2019/04/05 15:06:48 rps: 398, done 53940...
2019/04/05 15:06:51 rps: 398, done 55138...
2019/04/05 15:06:54 rps: 396, done 56337...
2019/04/05 15:06:57 rps: 397, done 57536...
2019/04/05 15:07:00 rps: 397, done 58736...
2019/04/05 15:07:03 rps: 397, done 59932...
2019/04/05 15:07:06 rps: 397, done 61131...
2019/04/05 15:07:09 rps: 396, done 62328...
2019/04/05 15:07:12 rps: 397, done 63527...
2019/04/05 15:07:15 rps: 397, done 64726...
2019/04/05 15:07:18 rps: 398, done 65926...
2019/04/05 15:07:21 rps: 397, done 67124...
2019/04/05 15:07:24 rps: 397, done 68325...
2019/04/05 15:07:27 rps: 397, done 69523...
2019/04/05 15:07:30 rps: 394, done 70720...
2019/04/05 15:07:33 rps: 397, done 71919...
2019/04/05 15:07:36 rps: 398, done 73118...
2019/04/05 15:07:39 rps: 397, done 74317...
2019/04/05 15:07:42 rps: 396, done 75513...
2019/04/05 15:07:45 rps: 395, done 76711...
2019/04/05 15:07:48 rps: 396, done 77910...
2019/04/05 15:07:51 rps: 396, done 79106...
2019/04/05 15:07:54 rps: 395, done 80305...
2019/04/05 15:07:57 rps: 397, done 81504...
2019/04/05 15:08:00 rps: 397, done 82703...
2019/04/05 15:08:03 rps: 397, done 83900...
2019/04/05 15:08:06 rps: 395, done 85099...
2019/04/05 15:08:09 rps: 398, done 86299...
2019/04/05 15:08:12 rps: 397, done 87495...
2019/04/05 15:08:15 rps: 397, done 88695...
2019/04/05 15:08:18 rps: 398, done 89894...
2019/04/05 15:08:21 rps: 398, done 91094...
2019/04/05 15:08:24 rps: 395, done 92292...
2019/04/05 15:08:27 rps: 397, done 93491...
2019/04/05 15:08:30 rps: 397, done 94691...
2019/04/05 15:08:33 rps: 397, done 95890...
2019/04/05 15:08:36 rps: 399, done 97089...
2019/04/05 15:08:39 rps: 396, done 98289...
2019/04/05 15:08:42 rps: 396, done 99486...
2019/04/05 15:08:43 done 100000.
2019/04/05 15:08:43 timeout errors 0
2019/04/05 15:08:43 other errors 0
2019/04/05 15:08:43 0.500000th: 2.073755ms
2019/04/05 15:08:43 0.750000th: 2.993537ms
2019/04/05 15:08:43 0.900000th: 5.120867ms
2019/04/05 15:08:43 0.950000th: 7.392001ms
2019/04/05 15:08:43 0.990000th: 14.158231ms
2019/04/05 15:08:43 0.999000th: 33.97195ms
2019/04/05 15:08:43 1.000000th: 37.474278ms
2019/04/05 15:08:43
{
"cluster-aggregated-stats": {
"closed-connections": 19,
"connections-attempts": 32,
"connections-failed": 0,
"connections-pool-empty": 0,
"connections-successful": 32,
"node-added-count": 9,
"node-removed-count": 3,
"open-connections": 13,
"partition-map-updates": 6,
"tends-failed": 0,
"tends-successful": 1506,
"tends-total": 1506
},
"open-connections": 13
}
By the same methodology the new version is better!
v2.0.0:
"cluster-aggregated-stats": {
"closed-connections": 16,
"connections-attempts": 27,
"connections-failed": 0,
"connections-pool-empty": 0,
"connections-successful": 27,
"node-added-count": 10,
"node-removed-count": 4,
"open-connections": 11,
"partition-map-updates": 6,
"tends-failed": 0,
"tends-successful": 1503,
"tends-total": 1503
},
"open-connections": 11
v1.39.0:
"cluster-aggregated-stats": {
"closed-connections": 17,
"connections-attempts": 29,
"connections-failed": 0,
"connections-pool-empty": 0,
"connections-successful": 29,
"node-added-count": 11,
"node-removed-count": 5,
"open-connections": 12,
"partition-map-updates": 6,
"tends-failed": 0,
"tends-successful": 1503,
"tends-total": 1503
},
"open-connections": 12
Thank you for your help, I think the problem is not in the new release. We need to fight against big records on a high rps.
Connections pre-warmup will be a great feature. 👍
Great! Do you still see the io.EOF
errors? Did the Warm Up featute help?
Well, I don't see increased error rates on deploy, so warmup helped a lot.
EOF aggregated from all servers(rpm):
Other errors(rpm):
When starting a service we got many errors like:
We have like 40k qps per instance into Aerospike, so connections are not warmed up at the start. It takes a lot of time to initiate the new ones. Increasing queue length helps a bit.
So it will be good to warm up conn. queue at the start by setting some options.