envoyproxy / nighthawk

L7 (HTTP/HTTPS/HTTP2/HTTP3) performance characterization tool
Apache License 2.0
361 stars 81 forks source link

Make `ProcessImpl::run()` more robust #317

Open oschaaf opened 4 years ago

oschaaf commented 4 years ago

Observed a CI run associated to https://github.com/envoyproxy/nighthawk/pull/316

[ RUN      ] ClientTest.AutoConcurrencyRun
[14:11:50.442223][31224][C] [source/client/process_impl.cc:124] assert failure: shutdown_. Details: shutdown not called before destruction.
[14:11:50.448097][31224][C] [bazel-out/k8-dbg/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:83] Caught Aborted, suspect faulting address 0x79f8
[14:11:50.448206][31224][C] [bazel-out/k8-dbg/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:70] Backtrace (use tools/stack_decode.py to get line numbers):
[14:11:50.449640][31224][C] [bazel-out/k8-dbg/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:71] Envoy version: 0/1.14.0-dev/redacted/DEBUG/BoringSSL
[14:11:50.670944][31224][C] [bazel-out/k8-dbg/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #0: Envoy::SignalAction::sigHandler() [0x683abe7]
[14:11:50.866130][31224][C] [bazel-out/k8-dbg/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #1: __tsan::CallUserSignalHandler() [0x2789350]
external/bazel_tools/tools/test/test-setup.sh: line 310: 31224 Aborted                 "${TEST_PATH}" "$@" 2>&1

Full log: https://circleci.com/api/v1.1/project/github/envoyproxy/nighthawk/11758/output/102/0?file=true&allocation-id=5e679b96b55f4d48eecdfc16-0-build%2F5ECCDF5B

The assert may or may not be related to the Envoy update in #316, it has been first observed there.

The assert that gets triggered resides here: https://github.com/envoyproxy/nighthawk/blob/b631a17bfbba1c1b46ac7b3b1bf5adb4b05719d7/source/client/process_impl.cc#L123

However, after scrutinizing the code over at https://github.com/envoyproxy/nighthawk/blob/b631a17bfbba1c1b46ac7b3b1bf5adb4b05719d7/source/client/process_impl.cc#L416 I think it the most likely thing that happened is that there was an exception in ProcessImpl::run() in somewhere after shutdown_ gets set to false. This code should be revisited to properly unwind in the (unlikely) case of partial initialization. At the very least it should log a fatal message when an exception gets thrown halfway, so that the message associated tot he exception doesn't get squelched by the assert above.

oschaaf commented 4 years ago

With #355 merged, I can no longer reproduce this, and when backing it out, with some effort, I can.

The asserts that triggers when the worker stacks unwind squelch the original exception(s) that started it all, thereby masking the actual problem.

We could remove our own assertion in that code paths, which ensure that workers are properly shutdown(), but then there will be other ones originating from Envoy's code base that will still fire, so that part is not trivial to resolve. It's also not easy to rewrite the startup code; there is a place where shutdown() will have to be called to cleanly stop, but we're not fully initialized yet.

On the flip side, #355 resolves the first real world instance of hitting this, so the pressure is off for now. Leaving this open for future discoverability and a reminder to pursue making the related code in process_impl.cc more robust.