envoyproxy / nighthawk

L7 (HTTP/HTTPS/HTTP2/HTTP3) performance characterization tool
Apache License 2.0
361 stars 81 forks source link

Nighthawk doesn't correctly drain the connection pool #873

Open mum4k opened 2 years ago

mum4k commented 2 years ago

The pool termination code is here.

While the code does register an idle callback, that callback only gets called if the pool is draining. Nighthawk doesn't seem to trigger pool drain anywhere. We should call drainConnections(DrainAndDelete) while waiting for the pool to drain.

The current state can cause Nighthawk to crash, because the shutdown thread owns the client dispatcher and may end up reading packets while shutting down resulting in a call to pure virtual method.

This can be reproduced by running high QPS test using HTTP3/QUIC, inducing a situation where Nighthawk has to deal with some late responses after it started shutting down.

oschaaf commented 2 years ago

We used to call drainConnections() .. but checking this call was removed in https://github.com/envoyproxy/nighthawk/commit/15b86be8ff62b6ab6de41511d37d445f240c56d4#diff-d638190c1ff04b6501032412ab8ae3d4322513fd75e43006ac399fc829be8328L112

That was a while ago, and I have no recollection of any reason why that call had to be removed. Therefore, I think that it may have been dropped by accident :-(

mum4k commented 2 years ago

Thank you for adding the details @oschaaf. This gives us reassurance that adding it back it should do no harm and should be an easy fix.

dubious90 commented 2 years ago

Did a quick check into this and this might require an upstream envoy change. https://github.com/envoyproxy/envoy/pull/16544/files obfuscated the connection pool behind a new object called PoolData. https://github.com/envoyproxy/envoy/pull/16865/files is an example change made previously by @mum4k that re-exposed some methods. A similar change is probably required to allow us to drain connections.