BatchGet doesn't wait for BasePolicy.TotalTimeout

xqzhang2015 commented 3 years ago

For waiting for new connection,

Get op could continuously wait for BasePolicy.TotalTimeout
BatchGet op only wait for BasePolicy.MaxRetries * BasePolicy.SleepBetweenRetries, and the SleepMultiplier doesn't work.
- replica=2 for the namespace
- no connection available in such case

https://github.com/aerospike/aerospike-client-go/blob/2a68420665951da79f541f4cdeba70ff38209ac3/command.go#L2079

For BatchGet, why doesn't it wait until BasePolicy.TotalTimeout?

khaf commented 3 years ago

retryBatch makes a new command with the same policy, and executes it, which will take care of the timeout logic.

xqzhang2015 commented 3 years ago

@khaf thanks for your quick reply.

For batchCommand.Execute(), the iterations arg of executeAt() is -1. But for the 1st call of retryBatch(), the iterations will be 0. And for the following retryBatch(), the iterations wil increase too. It seems this snippet doesn't work for batch retry.

cmd.conn, err = ifc.getConnection(policy)
        if err != nil {
            isClientTimeout = true

            if err == ErrConnectionPoolEmpty {
                // if the connection pool is empty, we still haven't tried
                // the transaction to increase the iteration count.
                iterations--
            }
            Logger.Debug("Node " + cmd.node.String() + ": " + err.Error())
            continue
        }

khaf commented 3 years ago

It does. In the new commands, a new node will be selected (you never know why connection pool is empty, but it should mostly be due to a faulty node). The new command will be retried on a new node. This logic of course does not retry on the same node.

xqzhang2015 commented 3 years ago

@khaf The case for empty connection pool here is that we don't maintain min idle connection pool, because of historical limitation. So the new BatchGet after long time always needs waiting for a connection. Besides that, burst traffic always needs waiting for new connections.

The question is new BatchGet doesn't wait for the BasePolicy.TotalTimeout. Here is the debug log below. So is there a way to wait until TotalTimeout for BatchGet?

BasePolicy.MaxRetries = 2
BasePolicy.SleepBetweenRetries = 1ms
Aerospike cluster replica = 2
only one key in the BatchGet request

I0607 07:36:39.613429   14051 command.go:1872] [aerospike] Node BB955E61AF14F0A 10.x.y.1:3000: Connection pool is empty. This happens when either all connection are in-use already, or no connections were available

I0607 07:36:39.614553   14051 command.go:1872] [aerospike] Node BB9D3C58669050A 10.x.y.2:3000: Connection pool is empty. This happens when either all connection are in-use already, or no connections were available

I0607 07:36:39.615713 14051 command.go:1872] [aerospike] Node BB955E61AF14F0A 10.x.y.1:3000: Connection pool is empty. This happens when either all connection are in-use already, or no connections were available

I0607 07:36:39.616863 14051 ... Get response: records=[nil]

For a new node will be selected in this case, the replica node will be selected and to retry, right?

aerospike client version: 3.1.1, and it seems latest code in master branch has the same logic.

xqzhang2015 commented 3 years ago

@khaf any comment?

khaf commented 3 years ago

Sorry I missed your comment. While a new replica should be selected, the problem is that if that replica is also connection starved, you'll end up with errors. I'll have to think about how to deal with this situation. Will get back to you later this week.

ps. Why are you so worried about using MinIdleConns?

xqzhang2015 commented 3 years ago

@khaf thanks for your comment.

MinIdleConns is great feature. Because of historical reasons, there are multiple clients in one app for the same cluster and there are thousands of app instances. If maintaing a proper MinIdleConns, the cluster can't support so many connections. BTW, the refine for my app can't be done in short term because of some limitations.

Most import, even enabling MinIdleConns, the conns will NOT be enough for traffic peak(10X compared to normal), because of so many app instances.

xqzhang2015 commented 3 years ago

@khaf any update from your side? I plan to increase the BasePolicy.MaxRetries and SleepBetweenRetries to mitigate the waiting time issue.

khaf commented 3 years ago

I'm still working on this, trying multiple different approaches. The current workaround would be increasing SleepBetweenRetries, but I hope I can get to a better solution.

khaf commented 3 years ago

After working on this issue for a few days, I have come to the conclusion that the best scalable way to handle this use case for now would be tweaking the value of the following policy attributes:

    .MaxRetries
    .SleepBetweenRetries
    .TotalTimeout
    .SocketTimeout

Any other solution adds a bottleneck for a different use case. The retry mechanism should handle these cases well, given enough time and retries. I may introduce a policy selector in the future that would allow to wait for inline (and synchronous) creation of a new connection to handle these use cases.

xqzhang2015 commented 3 years ago

@khaf sounds good. thanks for the work

aerospike / aerospike-client-go

BatchGet doesn't wait for BasePolicy.TotalTimeout #354