akka / akka-http

The Streaming-first HTTP server/module of Akka
https://doc.akka.io/docs/akka-http
Other
1.34k stars 595 forks source link

Client pool monitoring #1025

Open jrudolph opened 7 years ago

jrudolph commented 7 years ago

We have several tickets of hung pools and it's always hard to find out what's going on. It would be nice if there would be some (internal / unstable?) API to allow users to check what the pool is doing right now (and why). Using the flight recorder would be nice to allow recording of recent events to understand how the pool ended up in the current state.

See also suggestions by @ShaneDelmore in https://github.com/akka/akka-http/issues/741#issuecomment-287599964:

Adding some visibility to the state of finite resources would help as well. There are a lot of issues related to hung pools and the answer seems to always be "consume all of the bytes". If people are repeatedly failing to consume all of the bytes then i think they would benefit from some way to monitor which endpoints are open with no data flowing through them. If the server is just going to go unresponsive after 4 concurrent connections than a way to see how many connections are currently used would be great.

Maybe even a debugging mode to help detect suspicious usage. A "this request has been open for 5 minutes with no bytes sent" did you mean to do that? If someone has 500 different endpoints and one of them is using the pool incorrectly how would they even begin to troubleshoot that?

gosubpl commented 7 years ago

Any ideas on that API? I was planning to start doing something around the Pool.

ShaneDelmore commented 7 years ago

@gosubpl Given that Akka is frequently used to handle bounded resources, is there a pattern already used to see the statistics on those resources that could be emulated?

jrudolph commented 7 years ago

No, I don't think there's something to learn from right now. One problem would be that the pool is built from multiple asynchronous parts so that any monitoring can only provide an approximate view into the current state that might not be completely consistent at all times because of in-flight elements.

What about a visitor interface with several methods that will be called from different places for different kinds of events happening in the infrastructure?

jypma commented 7 years ago

Great to see progress on this! Idea alternatives for API:

  1. Log to a separate LoggingAdapter, and have monitoring implementations pick stuff off the standard logging EventBus. May be simpler and less of an API change, since many of the involved classes already have a LoggingAdapter. But should that be the one to get the monitoring data? Definitely not on DEBUG, since then it'll end up with everything else.
  2. Log to a new EventBus for statistics / monitoring. Cleaner, but would need somewhat larger changes.
  3. Expose JMX statistics (had to mention it)

While working on my "dirty" AspectJ-approach to this (going into kamon if it ends up working), I also noticed that e.g. PoolConductor.SlotSelector (which knows about the state of each conneciton, very valuable) doesn't know which host+port it's managing. I'm currently hacking that out of its log field (bleh). So a bit of extra information may need to be sprinkled down into the classes.

rbudzko commented 7 years ago

What about a visitor interface with several methods that will be called from different places for different kinds of events happening in the infrastructure?

I might be biased, but from my tiny research around my 'Akka circle' it seems everyone is interested in information why some part of the system is clogged. Information about how many things are in flight or other metrics like a number of processed requests/events seems to be very secondary. It appears it is more important to understand what we would like to see, not how technically pool will transfer stuff out to the user.

Basically, visiting events might build up a 'history' thanks to which developer will be able to understand what has happened. Unfortunately, it might be difficult without meaningful correlation (in the scope of one 'thing') of events shared with the world outside the pool.

I wonder (maybe it is already in place) if we can correlate 'work' between 'asynchronous parts' you have mentioned. Submitted piece of work can then emit events, which thanks to correlation might be useful for drawing a history of what pool has been doing. It doesn't have to be in real time obviously, because I think we are after knowledge, not about current state.

Anyway, I believe implementing a visitor to recreate a history in some default way and dumping it into some sink like logger or event bus (see @jypma ideas) might be necessary to see any adoption among users who live in very cold world.

jypma commented 7 years ago

Just another random observation: it would be really nice if things like

akka.actor.RepointableActorRef - Aborting tcp connection to some-host:8293 because of upstream failure: TCP idle-timeout encountered on connection to [some-host:8293], no bytes passed in the last 1 minute

would also be present on whichever stats mechanism is picked.

jypma commented 7 years ago

I just finished some horrible aspectJ code that does in fact monitor the client pool's queue, and connection states, here. But don't look at it, really, it will make you cry.

kjorg50 commented 3 years ago

@jypma just wondering, any update on this or https://github.com/akka/akka-http/pull/1443?