Open lzielezinska opened 2 years ago
Thank you for the bug report @lzielezinska. Nighthawk reuses significant portions of the Envoy code which includes code that establishes TLS sessions. It is possible that you have discovered a hot spot in the Envoy code.
To confirm this, would you be able to provide a few more details?
We can look to see if there is anything obvious in the outputs and counters. Once we have this, we can try to reproduce this problem and try to profile the execution.
hello let me share some info.
1.
full command line (both cases).(https://github.com/envoyproxy/nighthawk/files/8802648/853_bug_nighthawk_client_run.txt)
nighthawk output (+counters) for the execution without the custom transport socket. 853_bug_nighthawk_run_without_ts.txt
2.
3.
We run it against Nighthawk-Server with base test-server.yaml as a config map (kubernetes) https://github.com/envoyproxy/nighthawk/blob/main/source/server/README.md
cpu usage without transport socket
cpu usage with transport socket
Thank you for providing the details @pawelrxyz. The lowered CPU usage suggests some thread/lock contention. I will try to reproduce this problem and profile it to identify the contended path.
I reran the experiment and collected CPU and contention profiles. I wasn't able to reproduce the problem and did not see any unexpected behavior when I executed the test.
nighthawk_output_standard.txt is an experiment that runs without the custom TLS socket, i.e. session resumption was enabled. Confirmed by counters:
counters {
name: "ssl.handshake"
value: 886082
}
counters {
name: "ssl.session_reused"
value: 885932
}
nighthawk_output_custom_socket.txt is an experiment that runs with the custom TLS socket, i.e. session resumption was disabled. The counter ssl.session_reused
is missing, sessions weren't reused.
Each test took about 5 min of wall time. I observed the following:
Since contentions went down, I no longer suspect lock contention. The increased CPU usage in the NH test server seems to be directly related to the need to negotiate TLS sessions which is expected. See the diff below which compares the two runs:
The increased CPU usage in the NH client seems to be related to more calls to sched_yield()
which I suspect is due to threads waiting for TLS session negotiation on the server side. When that is filtered out, the remaining increased CPU usage is also related to negotiation of TLS sessions. See the diff below which compares the two runs:
Since I wasn't able to reproduce this, I could use your help here. For one, you could review this experiment and point out if I diverged from yours anywhere. Even better, if you can still reproduce this, could you provide both CPU and contention profiles from your run, so that we can see what the NH client and test server are doing while you are observing the decreased CPU usage.
Title: Results for nighthawk client configured not to reuse the session are 10 times less (and utilization of w CPU/RAM assign to nighthawk server is very low, less than 10% )
Description: Command used to run nighthawk:
nighthawk_client -p http2 --max-requests-per-connection 1 --address-family v4 "https://<IP>:<PORT>" --concurrency auto --rps "1000" --connections "$10" --duration "60" --request-body-size "400" --transport-socket '{"name": "envoy.transport_sockets.tls", "typed_config": { "@type":"type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext","max_session_keys":"0"}}'
When the paramater --transport-socket is not passed, we have noticed 10 times better results and almost 100% utilization of CPU assigned to nighthawk_server. We would like to run nighthawk in a way that client is not reusing the session, parameter max_session_keys":"0" configured that.
Reproduction steps: Run nighthawk with parameter:
nighthawk_client ... --transport-socket '{"name": "envoy.transport_sockets.tls", "typed_config": { "@type":"type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext","max_session_keys":"0"}}' ...
force Nighthawk client not to reuse session, but makes result very low and Nighthawk_server almost not working