Open xqzhang2015 opened 3 years ago
retryBatch
makes a new command with the same policy, and executes it, which will take care of the timeout logic.
@khaf thanks for your quick reply.
For batchCommand.Execute(), the iterations
arg of executeAt()
is -1.
But for the 1st call of retryBatch()
, the iterations
will be 0. And for the following retryBatch()
, the iterations wil increase too. It seems this snippet doesn't work for batch retry.
cmd.conn, err = ifc.getConnection(policy)
if err != nil {
isClientTimeout = true
if err == ErrConnectionPoolEmpty {
// if the connection pool is empty, we still haven't tried
// the transaction to increase the iteration count.
iterations--
}
Logger.Debug("Node " + cmd.node.String() + ": " + err.Error())
continue
}
It does. In the new commands, a new node will be selected (you never know why connection pool is empty, but it should mostly be due to a faulty node). The new command will be retried on a new node. This logic of course does not retry on the same node.
@khaf The case for empty connection pool here is that we don't maintain min idle connection pool, because of historical limitation. So the new BatchGet after long time always needs waiting for a connection. Besides that, burst traffic always needs waiting for new connections.
The question is new BatchGet doesn't wait for the BasePolicy.TotalTimeout. Here is the debug log below. So is there a way to wait until TotalTimeout for BatchGet?
I0607 07:36:39.613429 14051 command.go:1872] [aerospike] Node BB955E61AF14F0A 10.x.y.1:3000: Connection pool is empty. This happens when either all connection are in-use already, or no connections were available
I0607 07:36:39.614553 14051 command.go:1872] [aerospike] Node BB9D3C58669050A 10.x.y.2:3000: Connection pool is empty. This happens when either all connection are in-use already, or no connections were available
I0607 07:36:39.615713 14051 command.go:1872] [aerospike] Node BB955E61AF14F0A 10.x.y.1:3000: Connection pool is empty. This happens when either all connection are in-use already, or no connections were available
I0607 07:36:39.616863 14051 ... Get response: records=[nil]
For a new node will be selected
in this case, the replica node will be selected and to retry, right?
aerospike client version: 3.1.1, and it seems latest code in master branch has the same logic.
@khaf any comment?
Sorry I missed your comment. While a new replica should be selected, the problem is that if that replica is also connection starved, you'll end up with errors. I'll have to think about how to deal with this situation. Will get back to you later this week.
ps. Why are you so worried about using MinIdleConns?
@khaf thanks for your comment.
MinIdleConns is great feature. Because of historical reasons, there are multiple clients in one app for the same cluster and there are thousands of app instances. If maintaing a proper MinIdleConns, the cluster can't support so many connections. BTW, the refine for my app can't be done in short term because of some limitations.
Most import, even enabling MinIdleConns, the conns will NOT be enough for traffic peak(10X compared to normal), because of so many app instances.
@khaf any update from your side? I plan to increase the BasePolicy.MaxRetries and SleepBetweenRetries to mitigate the waiting time issue.
I'm still working on this, trying multiple different approaches. The current workaround would be increasing SleepBetweenRetries
, but I hope I can get to a better solution.
After working on this issue for a few days, I have come to the conclusion that the best scalable way to handle this use case for now would be tweaking the value of the following policy attributes:
.MaxRetries
.SleepBetweenRetries
.TotalTimeout
.SocketTimeout
Any other solution adds a bottleneck for a different use case. The retry mechanism should handle these cases well, given enough time and retries. I may introduce a policy selector in the future that would allow to wait for inline (and synchronous) creation of a new connection to handle these use cases.
@khaf sounds good. thanks for the work
For waiting for new connection,
https://github.com/aerospike/aerospike-client-go/blob/2a68420665951da79f541f4cdeba70ff38209ac3/command.go#L2079
For BatchGet, why doesn't it wait until BasePolicy.TotalTimeout?