aerospike / aerospike-client-csharp

Aerospike C# Client Library
70 stars 48 forks source link

Intermittent Timeout Exceptions after upgrading to Aerospike (5.6->6.0) and client (4.7.2->5.3.0) #74

Open kuskmen opened 1 year ago

kuskmen commented 1 year ago

Hello, after we upgraded the server and client to the aforementioned versions we started noticing intermittent Aerospike.Timeout error for random write operations (from full record create operation to simple int bin (boolean) update operation), most (if not all) settings are default from server and client point of view.

What we noticed is that client is throwing Timeout exceptions but server metrics in Grafana are not showing any indication of that, server logs are also only info.

We've checked all the suggestions mentioned here: https://support.aerospike.com/s/article/Warning-write-fail-queue-too-deep to potentially look for answers but to no avail.

We noticed that most of the time when a timeout occurs server has this log: https://docs.aerospike.com/reference/server-log#1663663594

With all that said, could it be that client has timeout issues in 5.3.0 as well, as we don't see any indication from server of requests being timed out?

The client configuration is also pretty straightforward:

return services.AddSingleton<IAsyncClient>(
         new AsyncClient(new AsyncClientPolicy
         {
              asyncMaxCommandAction = MaxCommandAction.DELAY,
         }, hosts));
}

everything else is the default.

BrianNichols commented 1 year ago

The server log only shows timeouts that occurred on the server side (from receiving of command to response). The client timeout monitors the full round-trip from sending the command to receiving the response. In the great majority of cases, the client initiates the timeout. The TimeoutException message starts with either "Client timeout" or "Server timeout".

I'm not aware of any premature timeout issues with the latest C# client. There is one outstanding latency issue, but it only applies to when LDAP servers are included in the Aerospike server configuration.

I suggest opening an Enterprise support case for this issue.

BrianNichols commented 1 year ago

I have recently learned that there can be performance degradation for queries that do not return much data in server 6.0. The reason is that server 6.0 switched to the new partition based query protocol for clients that support this protocol. The old query protocol may have shorter latency, but could return duplicate records or fail to return records when the cluster in migration. The new partition based query protocol eliminates the duplicate/missing records, but may result in longer latency. This applies to query only, so might not be applicable to your case.

kuskmen commented 1 year ago

Just to further draw attention to this issue, a different team from our company also experiences the same issues with NodeJs client on a completely different server setup (Aerospike in K8s (still 6.0+)) on a completely new aerospike. So it's beginning to feel more and more like a server issue