We are not able to make production one of our poc. Environment

AWS ec2 grpc server and aerospike as backend. aerospike server ssd (storage)

client co-ordinates

com.aerospike:aerospike-client:4.0.6

Problem

we are seeing some occasional outliers which bumps up the latency of our requests to 210-250 ms. I went and debugged further to eliminate GC pauses(STW) . We are using G1GC algorithm with SurvivorRatio:10 and eden space well enough young generation (min:1g and max:4g) allocation. No Full GC's are happening lot of minor GC's happening due to high allocation rate of 1.79gb/s, so we made some code changes and brought it down to 277MB/s now looks like minor GC's reduced.

We are Still seeing high latency for some occasional requests went and debugged further looks like major latency are caused in client.get api call for both sync and async clients.

I did visualVM profile for sync api call's and looks like most of the CPU time is spend in SocketInputStream read method call. Our thread pool thread is taking 94% of the time in read call. Please can you throw some light on this.Even i saw you guys used TCP_NO_DELAY as true this eliminates TCP Buffer latencies if any. It is still causing us spikes in latencies.

Our system requires strict SLA which is bound between 50-100 ms. Mainly for this reason we are not able to go to production. Please if you can help that would be great.

Let me know if you need any other information.

Thanks Pradeep

ferrari6666 commented 5 years ago

@BrianNichols Please if you can help that would be great.

BrianNichols commented 5 years ago

I did visualVM profile for sync api call's and looks like most of the CPU time is spend in SocketInputStream read method call.

This is normal because VisualVM counts time waiting for network events as cpu usage.

https://discuss.aerospike.com/t/cpu-analysis-of-java-client-seems-to-indicate-abormally-high-usage/2786

VisualVM has no way of knowing if a OS kernel call is in a wait state or not, so it reports time spent in the kernel call as pure cpu usage.

ferrari6666 commented 5 years ago

@BrianNichols Thanks for the reply. We are seeing high network latency from server to client which was about 206 ms . Do you think it's purely network or os bottleneck. Or can it be aersopike server level latencies but from histogram of server we never saw any request going beyong 1ms both read single and batch read even disk reads are fast.

BrianNichols commented 5 years ago

Aerospike Server versions < 4.4 measure latency from time socket request fully received till socket response complete. Aerospike Server versions >= 4.4 measure latency from begin of socket request receive till socket response complete. Therefore, Aerospike Server versions >= 4.4 will report higher latencies for requests (usually writes) with a large number of bytes.

Note that socket response complete just means the socket response response bytes were transferred to OS TCP buffers. It does not imply that the client actually received those bytes yet.

Client latencies include network latency, server latency and client cpu usage. Server latencies only include the time the server spent processing the request.

In our experience, cloud machines are usually more network bound than cpu bound. Cloud applications are also much more prone to latency spikes vs bare metal machines. If you are an enterprise customer, I suggest opening a case with Aerospike support.

aerospike / aerospike-client-java

Aerospike client async or sync api's latency jitters #145

We are not able to make production one of our poc. Environment

client co-ordinates

Problem