apple / swift-nio

Event-driven network application framework for high performance protocol servers & clients, non-blocking.
https://swiftpackageindex.com/apple/swift-nio/documentation
Apache License 2.0
7.92k stars 642 forks source link

[Feature Proposal] Sampling pipeline tracing functionality #1689

Open Lukasa opened 3 years ago

Lukasa commented 3 years ago

When running NIO applications in production it is possible to encounter rare, probabilistic issues. These issues happen with relatively low frequency and are, with the information available to the application developer, unpredictable. Generally they are the result of unforeseen interactions of components in a fairly complex system, often triggered by unexpected or unusual timing or asynchrony behaviours.

It is very rarely the case that there is sufficiently good logging in user code to diagnose this kind of interaction. In situations like this we tend to advise things like the WritePCAPHandler from NIOExtras, or a "Log everything handler" to be inserted into channel pipelines to get further diagnostics. This idea is reasonable, but it requires a custom intervention that is not normally present in user code. Additionally, generating these logs can be expensive, and so users are often forced to come up with systems where they only apply these handlers when appropriate, or risk slowing down their systems excessively in order to gather the logs.

NIO can and should do more to help here. In particular, NIO is deeply aware of the ChannelPipeline and all messages that pass through it. ChannelPipeline is a first-party class that allows no subclassing, as is ChannelHandler, and all messages on a channel pipeline must pass through them. This gives us a powerful place to add introspection, if we wanted to do so.

Using those hooks, we could build a "pipeline tracer". This is, conceptually, something that logs out all events that occur to a ChannelPipeline. A hypothetical output might be:

[pipeline=0xdeadbeef index=1 handlertype=NIOSSLClientHandler] handler added
[pipeline=0xdeadbeef index=2 handlertype=HTTPRequestDecoder] handler added
[pipeline=0xdeadbeef index=3 handlertype=HTTPResponseEncoder] handler added
[pipeline=0xdeadbeef] channel called fireChannelActive
[pipeline=0xdeadbeef index=0 handlertype=HeadChannelHandler] received channelActive
[pipeline=0xdeadbeef index=0 handlertype=HeadChannelHandler] called fireChannelActive
[pipeline=0xdeadbeef index=1 handlertype=NIOSSLClientHandler] received channelActive
[pipeline=0xdeadbeef index=1 handlertype=NIOSSLClientHandler data=ByteBuffer(212 bytes)] called write
...snip...

That is, this would produce a sequential log of events happening on the pipeline from the perspective of the pipeline. Because of the central position of the ChannelPipeline, it is not possible for events to go missing, nor is it possible for us to get confused about the ordering: we will always know exactly what happened and in what order.

This would be a powerful debugging tool. Structured carefully it could even allow programmatic analysis of the data, being able to visually draw out the state of the ChannelPipeline as well as the events that fire. Timestamps can potentially be included as well to give an idea of the delay between certain events. This can be thought of as "logging on steroids": it gives a complete picture of the status of the channel pipeline without requiring that any of the channel handlers participate in the logging process.

We can potentially make this richer by providing hooks that allow users to customise the traced representation of both data types and channel handlers. This would allow users to, for example, include the state of an object at the time the log was emitted. If we wanted to we could go even further and provide a hook on ChannelHandlerContext to inject entirely additional trace information, allowing tools like the HTTP/2 channel handler to log out state machine changes into the same trace.

The reason this behaviour is not on by default is because it is incredibly expensive. Logging each event like this will drastically increase the cost of pipeline walks, and it would be unbearable to have it on by default. To this end, if we wanted to add this functionality we'd almost certainly want to make it have a sampling functionality.

What I mean by this is that users would (probably on a bootstrap?) configure the "rate" at which they want pipeline tracing to occur. The default value would be 0 (i.e. never) but users could, say, pick a ratio of channels that will have this tracing enabled. By and large we would recommend that users tune this value in production to the highest value that they can tolerate the performance cost of. When the ratio is sufficiently low (e.g. 1/5000) the branch predictor should make most channels pay nothing for this logging, as the branch predictor will correctly conclude that the logging branches will not be taken. The affected channels will see a performance drop associated with this logging, which will be associated with latency and throughput outliers, but as long as the value is tunable I think this is acceptable. Many users will choose to plumb this value through in configuration, meaning that if a deployment is exhibiting unhealthy behaviour they will reload some of their systems with higher sampling rates to try to isolate the issues.

I'm curious as to what the rest of the team, as well as the wider community, think about this idea. Would it be helpful for investigating production incidents?

PeterAdams-A commented 3 years ago

Seems like an interesting idea. Need to make the performance great.

On the inbound side presumably user could get something similar by collecting a stack trace on event.

Could consider offering other cheaper things like simple event counts which might be revealing?

Lukasa commented 3 years ago

Stack traces can lose delays and other things, and may not be cheaper to gather.

avolokhov commented 3 years ago

If the error happens every once in a million interactions and tracing is set to 1/5000, the probability of accidentally catching it is impossibly low. Is there any chance to make it event-driven? Something like an in-memory ringbuffer of preconfigured size collecting the trace and a method that can dump it if exception/error/suspicion is caught?

Lukasa commented 3 years ago

If the error happens every once in a million interactions and tracing is set to 1/5000, the probability of accidentally catching it is impossibly low. Is there any chance to make it event-driven? Something like an in-memory ringbuffer of preconfigured size collecting the trace and a method that can dump it if exception/error/suspicion is caught?

Yes, in principle, but I don't want to make it the default. The reason I'm nervous is that the cost is not just the I/O of hitting the logging system, it's also the cost of formatting the logs. That can potentially be horrendously expensive, and logging this to a ring buffer is not going to make it vastly cheaper. A likely API model would be that you'd supply a function we handed the logs to, and that function could certainly arrange to have a ring buffer to which it writes. In those cases the necessary scaffolding is there. We could even productionise it if needed.

However, I want to push back a bit on the "one in a million vs one in five thousand case". Computing the probability of tracing an error event is straightforward: the probability of hitting an error multiplied by the probability of tracing. That is, the probability of tracing a one in a million error if you trace one in 5000 transactions is one in 5 billion. This sounds pretty bad! But, remember that we get to roll the dice repeatedly. Another way of restating "one in 5 billion" is "if we run 5 billion transactions, we can expect to sample this error once". So the question of whether 1/5000 works becomes one of load and of observation time. What load is the system under? If the system is handling one million transactions per second, we can reach our goal in 5000 seconds, or about one hour and twenty minutes. If the system is handing ten thousand transactions per second, we can reach our goal in 50,000 seconds, or almost 14 hours. But this can be tweaked: we can always double our sample rate, which will halve the time needed to get to our likely sampling event.

The nice thing here is that we can control the sample rate along with our load. High load systems can get away with lower sample rates, as even one in a million events happen extremely frequently on those systems. On the other hand, lower load systems can probably bear higher sample rates, as they necessarily have less of a tight latency and throughput requirement (they're handling less load).

In general I think sampling is a really useful methodology for this kind of intrusive logging. At scale, all possible bugs are inevitable, and even probabilistically you can reasonably expect to reproduce them under tracing. At lower scale, you can trace more aggressively and increase your odds of reproducibly tracing the output.

avolokhov commented 3 years ago

Fair point re: probabilities. On Logging side, I think the general direction for Logging is to get Logging events off thread as fast as possible, so formatting and I/O are done asynchronously. For infrequent errors if we're logging at least one log line per transaction, the overhead of logging on errors will become negligible. But you're right about scaffolding - once we have a way to get the data out of pipeline when/how it hits the disk is an implementation detail.

glbrntt commented 3 years ago

ChannelPipeline is a first-party class that allows no subclassing, as is ChannelHandler

I assume you mean ChannelHandlerContext rather than ChannelHandler here.


I'm curious as to what the rest of the team, as well as the wider community, think about this idea. Would it be helpful for investigating production incidents?

Yes, having an overall picture of events as the flow through the pipeline could be incredibly valuable. I have found when diagnosing issues in a NIO stack that often a lot of time is spent trying to triangulate where issues are (which is difficult when the NIO provided handlers don't emit logs). As an example, I believe having this would have made diagnosing the issue which #1624 fixed much much easier.


More of an implementation question, but how would this apply to child channels? I suspect we'd want a way to configure whether or not a child channel inherits whether it is traceable from the parent or not.

Lukasa commented 3 years ago

More of an implementation question, but how would this apply to child channels? I suspect we'd want a way to configure whether or not a child channel inherits whether it is traceable from the parent or not.

I am...nervous about that? That may want to be chosen by the individual child channels. It's a bit tricky as we don't have a way to "register" a child channel with its parent.

The reason I'm nervous about it is that it can lead to bad outcomes. For example, a grpc client that opens a single HTTP/2 connection for the entirety of its lifetime rolls the dice exactly once. If it hits, it logs all traffic for the duration of its run. If it misses, it logs none. That seems bad to me.

glbrntt commented 3 years ago

Oh, that's sort of what I was digging at, in what cases would the child channel would be configured. I.e. if the http/2 connection channel got lucky, would the http/2 stream channels on that connection all be traced or would each roll a dice? What if the connection channel was unlucky, would we still roll a dice and log just at the stream channel level?

Lukasa commented 3 years ago

The easiest thing (and probably the right thing to do, coincidentally) is to have all channels roll separately, even when they are child channels.

avolokhov commented 3 years ago

Imagine NIO handlers were emitting normal log events with some requestId-ish sampling key with logging backend filtering out not sampled events early in the pipeline. Would it solve the same problem? Sampling outside of NIO framework would potentially allow us to tweak sampling rate of the live instance (which can be super useful when an error is happening on only a subset of the fleet).

Lukasa commented 3 years ago

The sampling rate can be configurable at Channel creation time, which allows on-the-fly sample rate tweaking.

Doing this anywhere else risks swamping the logging backend, I think.