aerospike / aerospike-client-csharp

Aerospike C# Client Library
70 stars 47 forks source link

System.IndexOutOfRangeException in AsyncMultiExecutor -> ExecuteBatchRetry method. #50

Closed farmerau closed 4 years ago

farmerau commented 4 years ago

It appears there might be an off-by-one error in ExecuteBatchRetry. https://github.com/aerospike/aerospike-client-csharp/blob/575f466d3431c6bc4389b6a8d92f1dd635a5d3a5/Framework/AerospikeClient/Async/AsyncMultiExecutor.cs#L42

The math is correct, as long as cmd != orig evaluates to false at least once. However, in a simplified code example, we can easily prove that we'll always go out of bounds if it doesn't:

int[] commands = new[] {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
int[] cmds = new[]{11, 12, 13, 14, 15,};

int[] target = new int[commands.Length + cmds.Length - 1];
int count = 0;
foreach(int cmd in commands){
    target[count++] = cmd;
}

foreach(int cmd in cmds){
    target[count++] = cmd;
}

Screenshot from LINQPad execution: image

Ran into this issue sporadically using the AsyncClient, and was able to prevent the issue from persisting by setting retries to 0.

I assume the fix would be to not perform the subtraction operation from the sum of the lengths-- unless there's pretty heavy reasoning that cmd == true should evaluate to true at some point. Happy to submit a PR.

Edit: An alternative is to verify the equality checking between the command objects (== vs Equals)

BrianNichols commented 4 years ago

Async batch reads use AsyncMultiExecutor to execute batch commands to each node. If one node command fails, that command has to exist in AsyncMultiExecutor. When that command splits into new retry commands, the assumption is definitely relied on.

I suggest implementing your version of the fix and see if it fixes your problem. If so, issue a PR. I would also like understand the scenario in which the original command doesn't exist in AsyncMultiExecutor.

farmerau commented 4 years ago

@BrianNichols Thanks for getting back to me. I'll try to see what I can work on after work tonight to identify a scenario where the original command doesn't exist in the list.

farmerau commented 4 years ago

Hi @BrianNichols ! I've had a hard time reproducing this, primarily because it's difficult for me to get Aerospike to timeout (it's too fast!). Even with a 1ms socket timeout, it's difficult to reproduce.

I'm on a ticket with enterprise support for a related but different issue, and on a change in configuration was able to get the error to happen again (finally) only once.

Do you know of a way to force the socket timeout to fire for testing? I'd love to get this in a consistently reproducible fashion so that we can push this forward.

EDIT: I'm able to reproduce the timeouts by using Clumsy for Windows. I'm able to consistently get this error in my application, but I can't reproduce it in this repo.

farmerau commented 4 years ago

I'm trying to reproduce it in this repo, but so far am only able to do so within my application. Here's a sample stacktrace:

System.IndexOutOfRangeException
  HResult=0x80131508
  Message=Index was outside the bounds of the array.
  Source=AerospikeClient
  StackTrace:
   at Aerospike.Client.AsyncMultiExecutor.ExecuteBatchRetry(AsyncMultiCommand[] cmds, AsyncMultiCommand orig)
   at Aerospike.Client.AsyncBatchCommand.RetryBatch()
   at Aerospike.Client.AsyncCommand.Retry(AerospikeException ae)
   at Aerospike.Client.AsyncCommand.AlreadyCompleted(Int32 status)
   at Aerospike.Client.AsyncCommand.ConnectionFailed(AerospikeException ae)
   at Aerospike.Client.AsyncCommand.EventHandlers.HandleSocketEvent(Object sender, SocketAsyncEventArgs args)
   at System.Net.Sockets.SocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs e) in /_/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs:line 204
   at System.Net.Sockets.SocketAsyncEventArgs.ExecutionCallback(Object state) in /_/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs:line 424
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs:line 201
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() in /_/src/System.Private.CoreLib/shared/System/Runtime/ExceptionServices/ExceptionDispatchInfo.cs:line 63
   at System.Net.Sockets.SocketAsyncEventArgs.HandleCompletionPortCallbackError(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* nativeOverlapped) in /_/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.Windows.cs:line 1297
   at System.Net.Sockets.SocketAsyncEventArgs.<>c.<.cctor>b__177_0(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* nativeOverlapped) in /_/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.Windows.cs:line 1260
   at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* pNativeOverlapped) in /_/src/System.Private.CoreLib/src/System/Threading/Overlapped.cs:line 80

  This exception was originally thrown at this call stack:
    Aerospike.Client.AsyncMultiExecutor.ExecuteBatchRetry(Aerospike.Client.AsyncMultiCommand[], Aerospike.Client.AsyncMultiCommand)
    Aerospike.Client.AsyncBatchCommand.RetryBatch()
    Aerospike.Client.AsyncCommand.Retry(Aerospike.Client.AerospikeException)
    Aerospike.Client.AsyncCommand.AlreadyCompleted(int)
    Aerospike.Client.AsyncCommand.ConnectionFailed(Aerospike.Client.AerospikeException)
    Aerospike.Client.AsyncCommand.EventHandlers.HandleSocketEvent(object, System.Net.Sockets.SocketAsyncEventArgs)
    System.Net.Sockets.SocketAsyncEventArgs.OnCompleted(System.Net.Sockets.SocketAsyncEventArgs) in SocketAsyncEventArgs.cs
    System.Net.Sockets.SocketAsyncEventArgs.ExecutionCallback(object) in SocketAsyncEventArgs.cs
    System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, object) in ExecutionContext.cs
    System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() in ExceptionDispatchInfo.cs
    ...
    [Call Stack Truncated]
BrianNichols commented 4 years ago

It turns out that the "array out of bounds" error is a symptom of corruption that occurs when multiple async batch node commands fail and are simultaneously retried. ExecuteBatchRetry() does not perform commands array adjustment under lock which is a bug. I will be working on a fix tomorrow.

BrianNichols commented 4 years ago

C# client 3.9.8 is released:

https://www.aerospike.com/download/client/csharp/notes.html#3.9.8

Please try this release and let us know if it fixes the problem.

farmerau commented 4 years ago

Thanks, Brian! I’ll test this tomorrow morning.

farmerau commented 4 years ago

@BrianNichols I can confirm that this now works as expected. Thanks for your quick work on this-- it's greatly appreciated.