Measuring connection bytes read/written

gmaz42 commented 3 years ago

I am noticing a problem with reading objects that might be large (spanning 1-2 megabytes) when setting a deadline on the time allowed to retrieve them.

Unfortunately they are an exception rather than the rule, so I was wondering: would it be possible to get information about the read/written connection bytes from a Get* operation?

I think it is not currently possible but if you are willing to accept a PR for it I could add a new set of methods like GetObjectWithStats?

gmaz42 commented 3 years ago

@khaf please let me know what you think about this, I might start a PR.

khaf commented 3 years ago

Sorry I thought I had made a response, but apparently I hadn't posted it. Historically, we have been hesitant to add instrumentation to the client since it can affect performance. We are open to ideas, but at this moment we can not promise anything concrete until we make sure we can do it in a way that does not incur any performance penalties.

gmaz42 commented 2 years ago

I understand; I would also not be happy with writing an implementation that brings performance penalties.

I think this could be done in 2 ways:

a wrapper around the connection objects that simply records I/O bytes; this might already be possible but would not give context of which commands performed the I/O
a "stats" object used by every internal command that tracks the number of I/O bytes

I do not think that (2) would incur in (measurable) performance penalties because the measurement is already happening (each write/read on a connection returns the number of bytes), it is simply not accounted for.

The broader question would be how to "neatly" provide an API for accessing this information.

gmaz42 commented 2 years ago

Some ideas I have had so far:

attach the stats to the policy object which is currently passed as 1st argument
implement context (#255), then always check if there is such stats object in the context and use it if it is there (somewhat dirtier approach)

Another important bit of information when doing performance analysis is: which node did reply to my request? This can help spot degraded nodes.

By the way, I have not been able to analyze traffic in the cluster I am currently using to detect what could possibly cause slow replies and I am falling back to capturing traffic with tcpdump for later analysis; @khaf do you know about a better approach? I would be interested in knowing what happens during one of these GetObject / PutObject calls from server perspective.

khaf commented 2 years ago

@gmaz42 Sorry it seems this message flew under my radar. Both your approaches have major downsides: first one can lead to race conditions, and the second one needs a major API overhaul. The only way I see we can do something in that vein would be to include the stats with the record. Problem being that if the transaction hits an error mid transaction, you will lose the stats (since the returned record will be nil.)

I still do not know what could be best approach. At the moment we are considering exporting the stats outside the client, but even that comes with its own set of issues. We are still looking into it though.

gmaz42 commented 2 years ago

No problem, thanks for your reply @khaf!

first one can lead to race conditions

You could use the object extracted from context only if it satisfies a specific interface, and leave to caller to correctly implement that interface in a goroutine-safe fashion. This way you would support also the use-case that caller provides a single-use object before each call (that has no race condition concern). The client library would not even need to provide any solid implementation of this stats object, could use a private one for its tests alone.

I still do not know what could be best approach. At the moment we are considering exporting the stats outside the client, but even that comes with its own set of issues. We are still looking into it though.

I understand; thanks. Meanwhile my troubleshooting has gone a few levels lower in the stack and I am currently checking under which conditions timeouts happen (from Aerospike client PoV) because the process is not pulling out TCP packets fast enough from the TCP stack (this is a kubernetes scenario, so multiple processes and kernel scale internals are at play).

khaf commented 2 years ago

The callback approach for instrumentation sounds nice. Let me think about it for a bit and see where it takes me.

gmaz42 commented 2 years ago

Sure, take your time; I can also provide a mock example in a draft PR if you like. Thanks @khaf

P.S. I also left some question/doubt here (shameless plug)

aerospike / aerospike-client-go

Measuring connection bytes read/written #356