aerospike / aerospike-client-java

Aerospike Java Client Library
Other
236 stars 212 forks source link

Async client fails on out of connections #136

Closed kptfh closed 3 years ago

kptfh commented 5 years ago

Here is the simple OutOfConnectionTest test that shows that async client fails when it runs out of connections even if retry count is relatively large number.

After it get fixed I'd prefer to go further and do not treat OutOfconnection case as error but rather as normal case and silently retry it (put to the end of queue) despite maxRetries parameters value.

BrianNichols commented 5 years ago

Java client 4.4.1 now retries when the maxConnsPerNode limit is reached and maxRetries has not been reached. This will not eliminate the NO_MORE_CONNECTIONS error because it's identified very quickly and only delayed slightly by placing the retry at the end of the executing queue.

The problem is made worse because the code is only issuing writes. Writes will retry to the same master node (which just exhausted all connections). Reads, however, will retry to the prole node (assuming default replica policy SEQUENCE and server replication factor >= 2) which may have available connections.

maxConnsPerNode and the resulting NO_MORE_CONNECTIONS error is designed to prevent excess resource usage. Retrying indefinitely can cause the async queue to become too large where the application eventually runs out of memory.

The are two main resource limits when in async mode, "ClientPolicy.maxConnsPerNode" and "EventPolicy.maxCommandsInProcess". maxConnsPerNode can be rendered irrelevant by setting its value to the maximum number of commands allowed (maxCommandsInProcess) in each event loop (one connection per command).

EventPolicy eventPolicy = new EventPolicy();
eventPolicy.maxCommandsInProcess = 100;  // max 100 commands per event loop.
eventPolicy.maxCommandsInQueue = 0;      // unbounded queue

int eventLoopSize = 5;
EventLoops eventLoops = new NioEventLoops(eventPolicy, eventLoopSize);

ClientPolicy clientPolicy = new ClientPolicy();
clientPolicy.eventLoops = eventLoops;
clientPolicy.maxConnsPerNode = eventPolicy.maxCommandsInProcess * eventLoopSize;
kptfh commented 5 years ago

Thanks for the answer with a lot of details. The main question is memory and resource consumption. The amount of "maxCommandsInProcess" may be relatively high (for example 1000) in case of average sized commands. Is it OK to have large number in "maxConnsPerNode" (the same 1000)? Actually a lot of commands in queue is not a normal case but rather exceptional spikes. So it doesn't make sense to keep always a large number of connection that will waste most of the time.

BrianNichols commented 5 years ago

maxConnsPerNode does not preallocate/reserve any connections, so it's not an indicator of how many connections are actually being used.

maxConnsPerNode is total connections per node. maxCommandsInProcess is total commands (or connections) per event loop. To use a single limit, set "maxConnsPerNode = maxCommandsInProcess * eventLoopSize". In this case, maxConnsPerNode will only be reached if all commands are directed at the same node. If there is relatively even node distribution, the actual number of connections used per node will be significantly less.

Yes, it's okay to have a large maxConnsPerNode as long as your server nodes can handle the max connection load. Remember that many clients may be connecting to the same server nodes.

Java client versions >= 4.3.1 are much better at trimming idle connections, so exceptional spikes no longer leave excess connections for extended periods of time (> 2 minutes).

kptfh commented 5 years ago

Tried the option with large maxConnsPerNode. Got timeouts on spikes. Seems that node can't handle this load. So will need to add retries on each aerospike client call to be able to handle spikes. This will spoil code with boilerplate retries. As for me the best solution: client handles this case with retries keeping the number of connections relatively low.

barkanido commented 5 years ago

@kptfh we set a very strict (and low) maxCommandsInProcess. We also set a limit on the async queue, since an unbound queue is never a good idea (using maxCommandsInQueue). Then, if we get an AerospikeException.AsyncQueueFull exception, we retry with a backoff. Ideally, the queue would be a blocking queue (which blocks the calling thread when it is full), or at least the client will allow you to create such a queue as an option, but currently, it does not.

This is only true if you can backpressure your database calls.

kptfh commented 5 years ago

@barkanido Thanks for your comment. Fully agree with your concerns but it doesn't explain why we should fail with exception when maxConnsPerNode reached. My proposal it to place command to queue tail with backoff or without (using maxCommandsInQueue I will be able to customize it's size). This approach will allow us to prevent application crashing on spikes.

BrianNichols commented 5 years ago

@kptfh maxConnsPerNode exists because some customers want a strict limit on the number of connections to each node. If you don't require a strict limit to each node, you can eliminate maxConnsPerNode errors using the example code I provided in my initial response in this thread.

If this technique is used, it's important to limit the total maximum number of commands to the lowest number that will still utilize your full network bandwidth. Otherwise, those excess commands (and thus connections) will degrade latency.

total max commands = maxCommandsInProcess * eventLoopSize

Increasing event loops on machines with a large number of cpu cores will allow maxCommandsInProcess to be reduced on each event loop with the total max commands remaining constant.