Closed farmerau closed 4 years ago
Async batch reads use AsyncMultiExecutor to execute batch commands to each node. If one node command fails, that command has to exist in AsyncMultiExecutor. When that command splits into new retry commands, the assumption is definitely relied on.
I suggest implementing your version of the fix and see if it fixes your problem. If so, issue a PR. I would also like understand the scenario in which the original command doesn't exist in AsyncMultiExecutor.
@BrianNichols Thanks for getting back to me. I'll try to see what I can work on after work tonight to identify a scenario where the original command doesn't exist in the list.
Hi @BrianNichols ! I've had a hard time reproducing this, primarily because it's difficult for me to get Aerospike to timeout (it's too fast!). Even with a 1ms socket timeout, it's difficult to reproduce.
I'm on a ticket with enterprise support for a related but different issue, and on a change in configuration was able to get the error to happen again (finally) only once.
Do you know of a way to force the socket timeout to fire for testing? I'd love to get this in a consistently reproducible fashion so that we can push this forward.
EDIT: I'm able to reproduce the timeouts by using Clumsy for Windows. I'm able to consistently get this error in my application, but I can't reproduce it in this repo.
I'm trying to reproduce it in this repo, but so far am only able to do so within my application. Here's a sample stacktrace:
System.IndexOutOfRangeException
HResult=0x80131508
Message=Index was outside the bounds of the array.
Source=AerospikeClient
StackTrace:
at Aerospike.Client.AsyncMultiExecutor.ExecuteBatchRetry(AsyncMultiCommand[] cmds, AsyncMultiCommand orig)
at Aerospike.Client.AsyncBatchCommand.RetryBatch()
at Aerospike.Client.AsyncCommand.Retry(AerospikeException ae)
at Aerospike.Client.AsyncCommand.AlreadyCompleted(Int32 status)
at Aerospike.Client.AsyncCommand.ConnectionFailed(AerospikeException ae)
at Aerospike.Client.AsyncCommand.EventHandlers.HandleSocketEvent(Object sender, SocketAsyncEventArgs args)
at System.Net.Sockets.SocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs e) in /_/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs:line 204
at System.Net.Sockets.SocketAsyncEventArgs.ExecutionCallback(Object state) in /_/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs:line 424
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs:line 201
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() in /_/src/System.Private.CoreLib/shared/System/Runtime/ExceptionServices/ExceptionDispatchInfo.cs:line 63
at System.Net.Sockets.SocketAsyncEventArgs.HandleCompletionPortCallbackError(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* nativeOverlapped) in /_/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.Windows.cs:line 1297
at System.Net.Sockets.SocketAsyncEventArgs.<>c.<.cctor>b__177_0(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* nativeOverlapped) in /_/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.Windows.cs:line 1260
at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* pNativeOverlapped) in /_/src/System.Private.CoreLib/src/System/Threading/Overlapped.cs:line 80
This exception was originally thrown at this call stack:
Aerospike.Client.AsyncMultiExecutor.ExecuteBatchRetry(Aerospike.Client.AsyncMultiCommand[], Aerospike.Client.AsyncMultiCommand)
Aerospike.Client.AsyncBatchCommand.RetryBatch()
Aerospike.Client.AsyncCommand.Retry(Aerospike.Client.AerospikeException)
Aerospike.Client.AsyncCommand.AlreadyCompleted(int)
Aerospike.Client.AsyncCommand.ConnectionFailed(Aerospike.Client.AerospikeException)
Aerospike.Client.AsyncCommand.EventHandlers.HandleSocketEvent(object, System.Net.Sockets.SocketAsyncEventArgs)
System.Net.Sockets.SocketAsyncEventArgs.OnCompleted(System.Net.Sockets.SocketAsyncEventArgs) in SocketAsyncEventArgs.cs
System.Net.Sockets.SocketAsyncEventArgs.ExecutionCallback(object) in SocketAsyncEventArgs.cs
System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, object) in ExecutionContext.cs
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() in ExceptionDispatchInfo.cs
...
[Call Stack Truncated]
It turns out that the "array out of bounds" error is a symptom of corruption that occurs when multiple async batch node commands fail and are simultaneously retried. ExecuteBatchRetry() does not perform commands array adjustment under lock which is a bug. I will be working on a fix tomorrow.
C# client 3.9.8 is released:
https://www.aerospike.com/download/client/csharp/notes.html#3.9.8
Please try this release and let us know if it fixes the problem.
Thanks, Brian! I’ll test this tomorrow morning.
@BrianNichols I can confirm that this now works as expected. Thanks for your quick work on this-- it's greatly appreciated.
It appears there might be an off-by-one error in
ExecuteBatchRetry
. https://github.com/aerospike/aerospike-client-csharp/blob/575f466d3431c6bc4389b6a8d92f1dd635a5d3a5/Framework/AerospikeClient/Async/AsyncMultiExecutor.cs#L42The math is correct, as long as
cmd != orig
evaluates to false at least once. However, in a simplified code example, we can easily prove that we'll always go out of bounds if it doesn't:Screenshot from LINQPad execution:
Ran into this issue sporadically using the AsyncClient, and was able to prevent the issue from persisting by setting retries to 0.
I assume the fix would be to not perform the subtraction operation from the sum of the lengths-- unless there's pretty heavy reasoning that
cmd == true
should evaluate to true at some point. Happy to submit a PR.Edit: An alternative is to verify the equality checking between the command objects (== vs Equals)