Closed dAvagimyan closed 2 years ago
Server config and ClientPolicy config use a default version.
I am using go tool pprof.
Interesting. Are you using a lot of queries? What does you app do?
My application is dmp. Aerospike has about 300-400 million records ( user profiles). There are about 4 instances of the application. There are a lot of updates and getting profiles via api. RPS 15-20k
The hot path you have shared in the profile shows that your are running a lot of queries. What are they for? If you are using the default client policy, it means that your app uses a pool of 256 connection per node, which means each node should show 256 x 4 connections, and it does. Nothing out of the ordinary here. What is the reason your app crashes? Do you have any logs?
As soon as the total number of connections approaches a thousand (I believe that all 256 connections were created), the application crashes with timeout errors
Also what is the environment you are running your app in? Does it allow that many connections open by a process? What is the amount of RAM used by the client? Each connection begins by a 64KiB memory buffer and grows it as it needs. If you use a lot of queries, or simple CRUD with big records, those buffers will grow and on systems with limited RAM, the OS may terminate the process.
Here are some examples of requests.
We use docker containers. There are no limitations in containers for RAM. The total limit is about 6GB for each application node.
Another issue may be the values for Timeouts. If you use very long, or too short timeouts for transactions, or use a lot of queries which take a long time to complete, the app may run out of available connections very fast and experience a lot of timeouts due to waiting for connections.
Do you think conetion numbers from 256 to 512 can help?
I don't know, but changing it may give us an idea of what is going on. It seems that your queries are taking a long time to acquire connections from the pool, so there is starvation going on. You may want to rate limit your queries (or all your transactions) to see if that is an actual issue.
Could you tell me where it could be installed?
Keep in mind that a query is really calling Client.Query
API. It takes a connection to each node in parallel, so one call takes N connections (N is the number of nodes). Same are the Scan and Batch API.
I'll try to do it tomorrow and them i will write you here. Thanks for the idea.
You can also use the ClientPolicy.MaxErrorRate
and ClientPolicy.ErrorRateWindow
to release the pressure from the client when you encounter a lot of errors in a short period of time.
The problem is that at night rps drops to a range between 500 and 1000 requests per second. But the number of connections continues to grow.
Now i am runing my applications with config policy
clientPolicy.IdleTimeout = time.Second * 55 clientPolicy.ConnectionQueueSize = 512 clientPolicy.MaxErrorRate = 2
I think these values are too high. My suggestions:
clientPolicy.IdleTimeout = 5 * time.Second
clientPolicy.ConnectionQueueSize = 512
clientPolicy.MaxErrorRate = 100
If you have such high TPS, your connections will not need to remain idle that long, and server will already close idle connections after 10 seconds (unless you have changed it in your config), so you need a lower value.
MaxErrorRate
should be much higher, otherwise the client will not send any requests to the node for the remainder of the ErrorRateWindow
(1 second by default).
Thank you. I will try now.
It's just strange that when the RPS falls, the connections do not have time to close and continue to grow.
It is strange indeed. We will have to investigate further to find out what's causing this situation. What are your read and write policy timeout values?
I use default values.
I can post you more different charts.
Graph with your settings. The number of connections increases more slowly.
I use default values.
That may be the issue. The default timeouts are WAY too lax. They are there just as starting points. The correct way to find out the optimal timeouts is to benchmark your connection speeds, but as a starting point: For simple read, write, batch and operate commands:
TotalTimeout: 350 * time.Millisecond,
SocketTimeout: 100 * time.Millisecond,
MaxRetries: 2,
SleepBetweenRetries: 10 * time.Millisecond,
SleepMultiplier: 1.3,
For Scan and Query commands:
TotalTimeout: 0,
SocketTimeout: 500 * time.Millisecond,
MaxRetries: 2,
SleepBetweenRetries: 50 * time.Millisecond,
SleepMultiplier: 1.2,
Ok. I will try it tomorow. The strangest thing is that I have no problems in other projects using aerospike.
Launched services with new values. I will follow, I will write on the result. But most likely Monday.
Growing gorutines aerospike client every second
The code you shared does not reflect the hot path you are showing here. The profile is from a Query. Can you share that code and let me know how many records does it return and how frequently it is run? Keep in mind that query/batch/scan commands use N ( = cluster node) goroutines and connections per run since they run against each cluster node in parallel. The more of these you have, the more goroutines and connections in flight you will have.
I completed the issues. It was query resource not closed there. It was my bad. Thank you for help.
No problem, glad to hear the issue was resolved. Can you write what did you change to resolve the issue (including the bit of code) to help me learn from your experience and also improve the docs in the future?
Of course, it's a shame even to show the truth: there was my inattention.
Hello! I have a problem. I have a cluster aerospike servers, it consists of 5 nodes. My problem is that the number of active connections is growing and my go service is fall upon reaching a certain number of active connections.
Driver version -- github.com/aerospike/aerospike-client-go v4.5.2+incompatible Version aerospike server - 5.5.0.3
Please tell me what could be the problem.