Growing active connections

dAvagimyan commented 2 years ago

Hello! I have a problem. I have a cluster aerospike servers, it consists of 5 nodes. My problem is that the number of active connections is growing and my go service is fall upon reaching a certain number of active connections.

Driver version -- github.com/aerospike/aerospike-client-go v4.5.2+incompatible Version aerospike server - 5.5.0.3

Please tell me what could be the problem.

dAvagimyan commented 2 years ago

Server config and ClientPolicy config use a default version.

dAvagimyan commented 2 years ago

I am using go tool pprof.

khaf commented 2 years ago

Interesting. Are you using a lot of queries? What does you app do?

dAvagimyan commented 2 years ago

My application is dmp. Aerospike has about 300-400 million records ( user profiles). There are about 4 instances of the application. There are a lot of updates and getting profiles via api. RPS 15-20k

khaf commented 2 years ago

The hot path you have shared in the profile shows that your are running a lot of queries. What are they for? If you are using the default client policy, it means that your app uses a pool of 256 connection per node, which means each node should show 256 x 4 connections, and it does. Nothing out of the ordinary here. What is the reason your app crashes? Do you have any logs?

dAvagimyan commented 2 years ago

As soon as the total number of connections approaches a thousand (I believe that all 256 connections were created), the application crashes with timeout errors

khaf commented 2 years ago

Also what is the environment you are running your app in? Does it allow that many connections open by a process? What is the amount of RAM used by the client? Each connection begins by a 64KiB memory buffer and grows it as it needs. If you use a lot of queries, or simple CRUD with big records, those buffers will grow and on systems with limited RAM, the OS may terminate the process.

dAvagimyan commented 2 years ago

Here are some examples of requests.

dAvagimyan commented 2 years ago

We use docker containers. There are no limitations in containers for RAM. The total limit is about 6GB for each application node.

khaf commented 2 years ago

Another issue may be the values for Timeouts. If you use very long, or too short timeouts for transactions, or use a lot of queries which take a long time to complete, the app may run out of available connections very fast and experience a lot of timeouts due to waiting for connections.

dAvagimyan commented 2 years ago

Do you think conetion numbers from 256 to 512 can help?

khaf commented 2 years ago

I don't know, but changing it may give us an idea of what is going on. It seems that your queries are taking a long time to acquire connections from the pool, so there is starvation going on. You may want to rate limit your queries (or all your transactions) to see if that is an actual issue.

dAvagimyan commented 2 years ago

Could you tell me where it could be installed?

khaf commented 2 years ago

Keep in mind that a query is really calling Client.Query API. It takes a connection to each node in parallel, so one call takes N connections (N is the number of nodes). Same are the Scan and Batch API.

dAvagimyan commented 2 years ago

I'll try to do it tomorrow and them i will write you here. Thanks for the idea.

khaf commented 2 years ago

You can also use the ClientPolicy.MaxErrorRate and ClientPolicy.ErrorRateWindow to release the pressure from the client when you encounter a lot of errors in a short period of time.

dAvagimyan commented 2 years ago

The problem is that at night rps drops to a range between 500 and 1000 requests per second. But the number of connections continues to grow.

dAvagimyan commented 2 years ago

Now i am runing my applications with config policy clientPolicy.IdleTimeout = time.Second * 55 clientPolicy.ConnectionQueueSize = 512 clientPolicy.MaxErrorRate = 2

khaf commented 2 years ago

I think these values are too high. My suggestions:

clientPolicy.IdleTimeout = 5 * time.Second 
clientPolicy.ConnectionQueueSize = 512
clientPolicy.MaxErrorRate = 100

If you have such high TPS, your connections will not need to remain idle that long, and server will already close idle connections after 10 seconds (unless you have changed it in your config), so you need a lower value. MaxErrorRate should be much higher, otherwise the client will not send any requests to the node for the remainder of the ErrorRateWindow (1 second by default).

dAvagimyan commented 2 years ago

Thank you. I will try now.

dAvagimyan commented 2 years ago

It's just strange that when the RPS falls, the connections do not have time to close and continue to grow.

khaf commented 2 years ago

It is strange indeed. We will have to investigate further to find out what's causing this situation. What are your read and write policy timeout values?

dAvagimyan commented 2 years ago

I use default values.

dAvagimyan commented 2 years ago

I can post you more different charts.

dAvagimyan commented 2 years ago

Graph with your settings. The number of connections increases more slowly.

khaf commented 2 years ago

I use default values.

That may be the issue. The default timeouts are WAY too lax. They are there just as starting points. The correct way to find out the optimal timeouts is to benchmark your connection speeds, but as a starting point: For simple read, write, batch and operate commands:

        TotalTimeout:        350 * time.Millisecond,
        SocketTimeout:     100 * time.Millisecond,
        MaxRetries:            2,
        SleepBetweenRetries: 10 * time.Millisecond,
        SleepMultiplier:     1.3,

For Scan and Query commands:

        TotalTimeout:        0,
        SocketTimeout:     500 * time.Millisecond,
        MaxRetries:            2,
        SleepBetweenRetries: 50 * time.Millisecond,
        SleepMultiplier:     1.2,

dAvagimyan commented 2 years ago

Ok. I will try it tomorow. The strangest thing is that I have no problems in other projects using aerospike.

dAvagimyan commented 2 years ago

Launched services with new values. I will follow, I will write on the result. But most likely Monday.

dAvagimyan commented 2 years ago

Growing gorutines aerospike client every second

khaf commented 2 years ago

The code you shared does not reflect the hot path you are showing here. The profile is from a Query. Can you share that code and let me know how many records does it return and how frequently it is run? Keep in mind that query/batch/scan commands use N ( = cluster node) goroutines and connections per run since they run against each cluster node in parallel. The more of these you have, the more goroutines and connections in flight you will have.

dAvagimyan commented 2 years ago

I completed the issues. It was query resource not closed there. It was my bad. Thank you for help.

khaf commented 2 years ago

No problem, glad to hear the issue was resolved. Can you write what did you change to resolve the issue (including the bit of code) to help me learn from your experience and also improve the docs in the future?

dAvagimyan commented 2 years ago

Of course, it's a shame even to show the truth: there was my inattention.

aerospike / aerospike-client-go

Growing active connections #358