Open wbpcode opened 2 years ago
cc @mattklein123 @rojkov
Have you ever profiled Envoy with tools like perf, so that we can find some hotspots from the result?
Have you ever profiled Envoy with tools like perf, so that we can find some hotspots from the result?
I will do it this weekend. 😄
This feels like a duplicate of #13412.
This feels like a duplicate of #13412.
Yep. Looks like the degrading never stops.
Here are some flame graphs. I only have two different binary files with code symbols. Considering that there has been a huge gap in performance between them, I think it is enough as a reference.
v1.17.4: https://drive.google.com/file/d/1aZuC54PmIXsQu7k88jBeU-HENOEiQXgT/view?usp=sharing
v1.12.2: https://drive.google.com/file/d/1o6kV8T2J5nCs3m7lHaXSdXaudjzNJ9Sv/view?usp=sharing
v1.20.x https://drive.google.com/file/d/1lU539aRFzOCrR16EFTjKwu0MtzV03_kr/view?usp=sharing
cc @mattklein123
There are no obvious hotspots from the flame graph, just a very uniform slowdown. More encapsulation and abstraction gradually reduces the performance of Envoy.
I'm not sure if I buy that incremental changes in encapsulation and abstraction are likely to cause that much slowdown. I glanced at one of the flame-graphs and it's hard to know what to improve there.
It might help to capture some perf-graphs from an instrumented binary that can provide more detail on what's going on. The purely sampled view we get from these flame-graphs might (for example, as a guess) hide some effects of changes in the way we make the networking system calls. E.g. we spend a lot of time in the 1.20 flamegraph in the kernel. Are we making larger numbers of calls for smaller chunks of data, for any reason?
It would be nice to get a controlled repro of this, using NIghthawk preferably, and then run envoy under callgrind
or compiled with various perf tools, so we can see call-counts for various functions. Using this we could compare how we are structuring our system calls to the earlier version of Envoy.
I'm not sure if I buy that incremental changes in encapsulation and abstraction are likely to cause that much slowdown. I glanced at one of the flame-graphs and it's hard to know what to improve there.
It might help to capture some perf-graphs from an instrumented binary that can provide more detail on what's going on. The purely sampled view we get from these flame-graphs might (for example, as a guess) hide some effects of changes in the way we make the networking system calls. E.g. we spend a lot of time in the 1.20 flamegraph in the kernel. Are we making larger numbers of calls for smaller chunks of data, for any reason?
It would be nice to get a controlled repro of this, using NIghthawk preferably, and then run envoy under
callgrind
or compiled with various perf tools, so we can see call-counts for various functions. Using this we could compare how we are structuring our system calls to the earlier version of Envoy.
I will try to do more investigate.
Related to @jmarantz comment, over time we have generally moved to secure by default configuration. This almost universally is at odds with "horse race" benchmarks. So we will need to very carefully tease apart any differences. The work is useful but it's a lot more complicated than just looking at flame graphs. I also agree with @jmarantz that trying to run a reproducer under something like cachegrind will be a lot more informative.
Just to point this out: Matt suggests cachegrind and I suggested callgrind: they are related and both very useful:
--tool=callgrind
vs --tool=cachegrind
)--compilation_mode=opt --cxxopt=-g --cxxopt=-ggdb3
but those options might be dated as I haven't been able to do much direct technical work latelykcachegrind
@mattklein123 @jmarantz Thanks very much for all your suggestions. I will put more effort into tracking and investigating this issue recently.🌷
At present, I still subjectively think that it may caused by a large number of small adjustments. For example, #13423 introduces a small update to replace all white space chars in the response code details which will bring some minor new overhead. Assuming there are many similar PRs, the accumulation of these small overheads will still have a large enough impact. And because they are very scattered, it may also be difficult to locate.
But this is just my personal guess. We still need more research to identify the problem and try to solve it.
Of course, if it is just because of some more secure default configuration, then at most only the documentation needs to be updated.
2021/11/27 callgrind.out file of v1.12.2 vs v1.20.x (100, 000 HTTP1.1 request with wrk).
https://drive.google.com/drive/folders/1EWTjixvN43O8u24a_rJF8S6A4ePvFwC1?usp=sharing
First point: encodeFormattedHeader
introduced ~3% external CPU overhead.
Root cause: Too much fine grained buffer API access.
Related PR: #9825
Great -- would you be able to supply the parameters passed to Nighthawk (preferably) or whatever tool you were using to generate the load for these traces?
Thanks!
This data is great! It really shows the profiles look very different. Just sorting by 'self' and comparing the two views provides a ton of possible issues to go explore.
Did you use an "OptDebug" build or was this a stock build or something?
I used simple hey
to generate 100, 000 http1 request. Here is the command.
hey -n 100000 -c 20 http://localhost:9090
.
I had not heard of hey
. Is that this? https://github.com/rakyll/hey
Did you use an "OptDebug" build or was this a stock build or something?
I used the no stripped binary that build with ci/do_ci.sh bazel.release
. It finally worked, but I'm not sure it is OptDebug
mode.
I had not heard of
hey
. Is that this? https://github.com/rakyll/hey
Yes, I am generally used to using wrk, but it seems that wrk cannot generate a fixed amount of load.
Nighthawk (https://github.com/envoyproxy/nighthawk) is really what we want to converge to, as you get explicit control of open/closed loop, ability to generate http2, async vs concurrent, an indication of how many requests succeeded/failed, etc.
By "fixed amount of load" does this mean a finite number of requests? Or a fixed rate of requests?
The docker images are definitely not in OptDebug mode :) They are probably simply optimized, which is OK, but we'll have a lot less details on call-stack, where in functions time is being spent, etc.
Nighthawk (https://github.com/envoyproxy/nighthawk) is really what we want to converge to, as you get explicit control of open/closed loop, ability to generate http2, async vs concurrent, an indication of how many requests succeeded/failed, etc.
Thanks, I will try it in the coming test.
The docker images are definitely not in OptDebug mode :) They are probably simply optimized, which is OK, but we'll have a lot less details on call-stack, where in functions time is being spent, etc.
I see. Thanks. 🌷 I will try to build new binary with you suggested compile args and do some more investigates. But recently I only have enough time on weekends.
No worries -- the data you supplied is great. We'll definitely want to repro with NH though so we understand what we are comparing :)
My suspicion is that you've found something real, and based on the traces I looked at, there were changes in either tcmalloc's implementation, the way we allocate buffers, or both. The encodeFormattedHeader
hot-spot probably deserves a quick look also as that is a relatively recent feature. Did you wind up turning that on?
What did you have running on localhost:8080? An Apache server or something?
What did you have running on localhost:8080? An Apache server or something?
A multi-processes Nginx that will return simple short string directly.
Did you wind up turning that on?
In fact, nope. Nevertheless, the new version of encodeFormatedHeader/encodeHeader
still introduces more overhead. The reason is that encodeHeaders
call encodeFormatedHeader
several times and write some fine-grained data (a character, a space, a header name, etc.) to the buffer (watermark buffer) directly.
The new version does not use a local cache to speed up the writing of these fine-grained data.
I've created a PR #19115 to try to solve this problem.
@wbpcode do you mean the issue only exists in HTTP1 or you only test it by using HTTP1, I ask this because I found in your PR #19115, you only fix HTTP1 codes, thanks for clarification.
@wbpcode do you mean the issue only exists in HTTP1 or you only test it by using HTTP1, I ask this because I found in your PR #19115, you only fix HTTP1 codes, thanks for clarification.
I only test it by using HTTP1. And codec is not the only problem. We need more investigations and PRs to continue to alleviate this problem.
What's the envoy cpu usage between 1.12 and 1.20? I asked because I remember some event handle is pushed to the next poller cycle. It means envoy trades latency for fairness. Now that the benchmark client maintains the 20 active req/resp, it's likely the cpu usage at envoy is lower in 1.20
@lambdai
When I tested the throughput, both 1.12 and 1.20 used a full core. When I collected callgrind data later, because it was executed in valgrind, I didn't pay attention to the cpu situation. Moreover, when we analyze callgrind data, my main concern is the change in the proportion of overhead of different functions.
Via https://news.ycombinator.com/item?id=29457579 there are some good examples of projects that track performance over time:
Not to hijack this thread, but this might be possibly related: on our end we recently started looking at slow_alloc()
dominating CPU time as seen in our view of Envoy via perf top
:
cc: @fishcakez
Thank you for this information. We use envoy in high-load projects. It is strange to see such a drop in performance in the new version.
I hope the test will be replenished with new versions of the envoy and performance will improve over time
@VA-PLE We are working towards this.
Are 1.3 and 1.7 versions of something other than Envoy? Those seem like very low version numbers.
I think it probably makes sense to do more benchmarking on HEAD once #19115 lands. Then we should capture more traces to find the next bottleneck.
Are 1.3 and 1.7 versions of something other than Envoy? Those seem like very low version numbers.
I think it probably makes sense to do more benchmarking on HEAD once #19115 lands. Then we should capture more traces to find the next bottleneck.
Sorry, please just ignore my previous commont. When I use wrk test, I always test v1.3 first(The cache may be cold) This may gave me wrong answer. Just now, I retest, the result is the tps between v1.3 and v1.7 is very close; and from latency aspect, the v1.7 is slightly better. But v1.7 uses 20% much more cpu than v1.3. But the nighthawk result is always that v1.3 is 20% better than v1.7. I'll continue testing.
@jiangdongzi It would be better to compare some new versions. v1.3
and v1.7
look like too old. 🤔
Yeah they way too old to be useful Envoy versions. I suspect @jiangdongzi is referring to an Istio version number or some other packaging of Envoy.
In any case, there are a lot of complexities to benchmarking Envoy well, and it's worth reading up on:
I found the material @htuch wrote but it's in the official docs, not a white paper: https://www.envoyproxy.io/docs/envoy/latest/faq/performance/how_to_benchmark_envoy
Some insteresting result from perf. The readDisable
calls from HTTP1 codec result in ~7% cpu overhead.
I am on a machine with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz, --concurrency=1, all on one host, the proxy qps can only reach about 10k, and nginx (worker_processes 1, qps 27K) is a huge difference. And as the number of worker threads increases, the gap gets bigger and bigger. We are doing api gateway technology selection, Have you tested the performance comparison between envoy and nginx under the same environmental conditions?
@ztgoto I have tested Envoy (envoy.filters.http.router) and Nginx (proxy_pass) as HTTP proxy, with both in single thread and have 64 conns loaded by fortio. Envoy has more than 10% performance gain compared to Nginx in RPS. Can you show example configurations in your case?
@ztgoto I have tested Envoy (envoy.filters.http.router) and Nginx (proxy_pass) as HTTP proxy, with both in single thread and have 64 conns loaded by fortio. Envoy has more than 10% performance gain compared to Nginx in RPS. Can you show example configurations in your case?
envoy
admin:
access_log_path: /dev/null
address:
socket_address: {address: 0.0.0.0, port_value: 7777}
static_resources:
listeners:
- name: listener
address:
socket_address: {address: 0.0.0.0, port_value: 9104}
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: http_test
codec_type: AUTO
generate_request_id: false
route_config:
name: route
virtual_hosts:
- name: test
domains: ["*"]
routes:
- match: { prefix: "/" }
route:
cluster: nginx
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
dynamic_stats: false
clusters:
- name: "nginx"
connect_timeout: "5s"
type: STRICT_DNS
lb_policy: ROUND_ROBIN
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 1000000000
max_pending_requests: 1000000000
max_requests: 1000000000
max_retries: 0
- priority: HIGH
max_connections: 1000000000
max_pending_requests: 1000000000
max_requests: 1000000000
max_retries: 0
load_assignment:
cluster_name: nginx
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 192.168.1.100
port_value: 9101
stats_config:
stats_matcher:
reject_all: true
nginx
worker_processes 1;
events {
worker_connections 1024;
}
http {
include mime.types;
default_type application/octet-stream;
access_log off;
sendfile on;
keepalive_timeout 65;
upstream test {
keepalive 50;
server 192.168.1.100:9101;
}
server {
listen 9103;
server_name 192.168.4.119;
location / {
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_pass http://test;
}
}
}
@krajshiva @yanavlasov here's some anecdotes about performance gaps between nginx and envoy.
@KBaichoo may also be interested
@ztgoto PTAL @ https://www.envoyproxy.io/docs/envoy/latest/faq/performance/how_to_benchmark_envoy
In particular, can you disable circuit-breaking and re-run your benchmark (per recommendations in the benchmarking doc above). It's not clear to me if that's a bottleneck in your case, but you have not configured circuit-breaking on the nginx side so it seems appropriate to keep the behavior consistent.
Also, did you build your own Envoy from source? Or use a pre-built package? If you built it yourself did you use -c opt
?
@ztgoto What HTTP protocol is being used? HTTP/1, or HTTP/2? In particular for HTTP2 in https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/protocol.proto#config-core-v3-http2protocoloptions
max_concurrent_streams
defaults to 2147483647 which I've seen this skew some benchmarks prior.
@ztgoto PTAL @ https://www.envoyproxy.io/docs/envoy/latest/faq/performance/how_to_benchmark_envoy
In particular, can you disable circuit-breaking and re-run your benchmark (per recommendations in the benchmarking doc above). It's not clear to me if that's a bottleneck in your case, but you have not configured circuit-breaking on the nginx side so it seems appropriate to keep the behavior consistent.
Also, did you build your own Envoy from source? Or use a pre-built package? If you built it yourself did you use
-c opt
?
@jmarantz I use envoyproxy/envoy:v1.22.2 (--network host) for testing. The config file is as stated above, I don't know if there is anything wrong.
envoy
wrk -t 8 -c 32 -d 60s --latency 'http://127.0.0.1:9104/hello'
Running 1m test @ http://127.0.0.1:9104/hello
8 threads and 32 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.76ms 378.73us 6.53ms 66.32%
Req/Sec 1.45k 65.16 2.12k 71.77%
Latency Distribution
50% 2.69ms
75% 3.02ms
90% 3.33ms
99% 3.67ms
693601 requests in 1.00m, 156.77MB read
Requests/sec: 11553.84
Transfer/sec: 2.61MB
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
160976 101 20 0 2247104 54512 19432 S 100.3 0.1 2:13.85 envoy
32221 root 20 0 706612 8384 1160 S 32.8 0.0 0:04.63 wrk
nginx
wrk -t 8 -c 32 -d 60s --latency 'http://127.0.0.1:9103/hello'
Running 1m test @ http://127.0.0.1:9103/hello
8 threads and 32 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.12ms 107.53us 13.65ms 78.66%
Req/Sec 3.56k 202.05 4.44k 76.00%
Latency Distribution
50% 1.12ms
75% 1.18ms
90% 1.24ms
99% 1.37ms
1701221 requests in 1.00m, 379.64MB read
Requests/sec: 28351.42
Transfer/sec: 6.33MB
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
58685 nobody 20 0 46588 2752 1116 R 100.0 0.0 0:10.66 nginx
60538 root 20 0 701960 3728 1136 S 89.4 0.0 0:09.88 wrk
Hi, We have upgraded our envoy version which is ~5 years old to latest v1.25.3. We have executed few perf tests and noticed that there seems to be 60% increase in CPU and 5x increase in memory compared to earlier version of envoy with same test. CPU went up-to 400m and memory was ~120M. Would some one please review the flame graph and share the findings if this can be further optimized? Thanks in advance.
Flame graph: https://github.com/scheruku-in/Envoy_Perf_Test/blob/main/envoy_highcpu.svg
Envoy config:
{
"listeners": [
{
"address": "tcp://0.0.0.0:17600",
"ssl_context": {
"ca_cert_file": "envoycacert.pem",
"cert_chain_file": "cacertchain.pem",
"private_key_file": "key.pem",
"alpn_protocols": "h2,http/1.1",
"alt_alpn_protocols": "http/1.1"
},
"filters": [
{
"type": "read",
"name": "http_connection_manager",
"config": {
"access_log": [
{
"format": "[%START_TIME%] \"%REQ(X-FORWARDED-FOR)%\" - \"%REQ(USER-AGENT)%\" \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE% %BYTES_RECEIVED% %BYTES_SENT% x-request-id = \"%REQ(X-REQUEST-ID)%\" x-global-transaction-id = \"%REQ(X-GLOBAL-TRANSACTION-ID)%\" \"%REQ(:AUTHORITY)%\" \"%UPSTREAM_HOST%\" \"%UPSTREAM_CLUSTER%\" rt=\"%DURATION%\" uct=\"%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%\" \n",
"path": "/dev/stdout"
}],
"generate_request_id": true,
"codec_type": "auto",
"idle_timeout_s": 120,
"stat_prefix": "ingress_http",
"route_config": {
"virtual_hosts": [
{
"name": "service",
"domains": ["*"],
"require_ssl": "all",
"routes": [
{
"timeout_ms": 120000,
"retry_policy": {
"retry_on": "gateway-error,connect-failure",
"num_retries": 120
},
"prefix": "/",
"cluster_header" : "<cluster-header>"
}
]
}
]
},
"filters": [
{
"type": "decoder",
"name": "router",
"config": {}
}
]
}
}
]
}
],
"admin": {
"access_log_path": "/dev/stdout",
"address": "tcp://127.0.0.1:8001"
},
"cluster_manager": {
"clusters": [
],
"cds": {
"cluster": {
"name": "cds",
"connect_timeout_ms": 120000,
"type": "strict_dns",
"lb_type": "round_robin",
"hosts": [
{
"url": "tcp://localhost:8081"
}
]
},
"refresh_delay_ms": 100
},
"outlier_detection": {
"event_log_path": "/dev/stdout"
}
}
}
Thanks in advance.
Update from 2024:
Version | QPS |
---|---|
1.13 | 65741 |
1.14 | 60302 |
1.15 | 52966 |
1.16 | 49947 |
1.17 | 45257 |
1.18 | 42121 |
1.19 | 41800 |
1.20 | 39946 |
1.21 | 39649 |
1.22 | 47829 |
1.23 | 45302 |
1.24 | 42135 |
1.25 | 41102 |
1.26 | 39932 |
1.27 | 41539 |
1.28 | 36995 |
1.29 | 37273 |
1.30 | 37962 |
2024: Returns the results directly, The stress testing tool uses wrk version: envoy:1.30.4 nginx: 1.26.1 config:
envoy
./envoy --concurrency 1 -c ./envoy.yaml
static_resources:
listeners:
- name: listener
address:
socket_address: {address: 0.0.0.0, port_value: 9101}
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: http_test
codec_type: AUTO
generate_request_id: false
route_config:
name: route
virtual_hosts:
- name: test
domains: ["*"]
routes:
- match: { prefix: "/" }
direct_response:
status: 200
body:
inline_string: "{\"message\":\"hello\"}"
response_headers_to_add:
- header:
key: "Content-Type"
value: "application/json"
# route:
# cluster: auth
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
dynamic_stats: false
stats_config:
stats_matcher:
reject_all: true
nginx
worker_processes 1;
events {
worker_connections 1024;
}
http {
include mime.types;
default_type application/octet-stream;
access_log off;
sendfile on;
keepalive_timeout 65;
server {
listen 9101;
server_name 127.0.0.1;
location / {
default_type application/json;
return 200 '{"message":"hello"}';
}
location = /50x.html {
root html;
}
}
}
result:
envoy:
./wrk -t 8 -c 1000 -d 60s 'http://127.0.0.1:9101'
Running 1m test @ http://127.0.0.1:9101
8 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 69.34ms 7.74ms 123.77ms 64.42%
Req/Sec 1.81k 521.16 4.62k 38.42%
864304 requests in 1.00m, 117.05MB read
Requests/sec: 14382.08
Transfer/sec: 1.95MB
nginx:
./wrk -t 8 -c 1000 -d 60s 'http://127.0.0.1:9101'
Running 1m test @ http://127.0.0.1:9101
8 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 39.28ms 99.20ms 2.00s 95.30%
Req/Sec 6.09k 505.22 7.36k 72.21%
2908817 requests in 1.00m, 479.91MB read
Socket errors: connect 0, read 4672, write 0, timeout 13
Requests/sec: 48404.51
Transfer/sec: 7.99MB
Flame graph envoy:
nginx:
2024: Returns the results directly, The stress testing tool uses wrk version: envoy:1.30.4 nginx: 1.26.1 config:
I have benchmarked Envoy with direct_response
once with Fortio with keep-alive disabled. I noticed that Envoy does not close the connection immediately after sending the response, which leads to performance degradation. Your scenario differs from mine, but direct_response
isn't commonly used and may not be fully optimized. I suggest testing Envoy in its typical workload as a router.
@zhxie I've done tests in routing scenarios before, and the results have been published above, and the performance gap is relatively large
I did some simple tests today and found that envoy's performance seems to be getting worse and worse. Although I know features should always come at some cost. But those costs seem to be too much.
And although I know that performance is not the first goal of Envoy. But with continuous enhancement of Envoy, the performance seems to degrade too quickly. Here are some simple results of single envoy worker:
I think we should at least prevent it from further deterioration, and at the same time find ways to optimize it.
Here is my config yaml:
Back-end: multi-processes Nginx that will return simple short string.
Client command: wrk -c 200 -t 2 -d 180s http://localhost:9090/