some performance results of envoy's different versions

wbpcode commented 2 years ago

I did some simple tests today and found that envoy's performance seems to be getting worse and worse. Although I know features should always come at some cost. But those costs seem to be too much.

And although I know that performance is not the first goal of Envoy. But with continuous enhancement of Envoy, the performance seems to degrade too quickly. Here are some simple results of single envoy worker:

Version	QPS
v1.12.4	26921.48
v1.13.3	25182.18
v1.14.7	23732.31
v1.15.5	21010.66
v1.16.5	19116.81
v1.17.4	17804.78
v1.18.4	16953.67
v1.19.1	16046.59
v1.20.0	15949.65

I think we should at least prevent it from further deterioration, and at the same time find ways to optimize it.

Here is my config yaml:

static_resources:
  listeners:
    - address:
        socket_address:
          address: 0.0.0.0
          port_value: 9090
      filter_chains:
        - filters:
            - name: envoy.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                codec_type: auto
                stat_prefix: ingress_http
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: host-one # prefix route
                      domains:
                        - "*"
                      routes:
                        - match:
                            prefix: "/"
                          route:
                            cluster: httpbin
                http_filters:
                  - name: envoy.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
                      dynamic_stats: false
                      suppress_envoy_headers: true
  clusters:
    - name: httpbin
      connect_timeout: 5s
      type: strict_dns
      lb_policy: round_robin
      load_assignment:
        cluster_name: httpbin
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: localhost
                      port_value: 8080

admin:
  access_log_path: "/dev/null"
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 20000

Back-end: multi-processes Nginx that will return simple short string.

Client command: wrk -c 200 -t 2 -d 180s http://localhost:9090/

wbpcode commented 2 years ago

cc @mattklein123 @rojkov

zhxie commented 2 years ago

Have you ever profiled Envoy with tools like perf, so that we can find some hotspots from the result?

wbpcode commented 2 years ago

Have you ever profiled Envoy with tools like perf, so that we can find some hotspots from the result?

I will do it this weekend. 😄

rojkov commented 2 years ago

This feels like a duplicate of #13412.

wbpcode commented 2 years ago

This feels like a duplicate of #13412.

Yep. Looks like the degrading never stops.

wbpcode commented 2 years ago

Here are some flame graphs. ~~I only have two different binary files with code symbols~~. Considering that there has been a huge gap in performance between them, I think it is enough as a reference.

v1.17.4: https://drive.google.com/file/d/1aZuC54PmIXsQu7k88jBeU-HENOEiQXgT/view?usp=sharing

v1.12.2: https://drive.google.com/file/d/1o6kV8T2J5nCs3m7lHaXSdXaudjzNJ9Sv/view?usp=sharing

v1.20.x https://drive.google.com/file/d/1lU539aRFzOCrR16EFTjKwu0MtzV03_kr/view?usp=sharing

cc @mattklein123

wbpcode commented 2 years ago

There are no obvious hotspots from the flame graph, just a very uniform slowdown. More encapsulation and abstraction gradually reduces the performance of Envoy.

jmarantz commented 2 years ago

I'm not sure if I buy that incremental changes in encapsulation and abstraction are likely to cause that much slowdown. I glanced at one of the flame-graphs and it's hard to know what to improve there.

It might help to capture some perf-graphs from an instrumented binary that can provide more detail on what's going on. The purely sampled view we get from these flame-graphs might (for example, as a guess) hide some effects of changes in the way we make the networking system calls. E.g. we spend a lot of time in the 1.20 flamegraph in the kernel. Are we making larger numbers of calls for smaller chunks of data, for any reason?

It would be nice to get a controlled repro of this, using NIghthawk preferably, and then run envoy under callgrind or compiled with various perf tools, so we can see call-counts for various functions. Using this we could compare how we are structuring our system calls to the earlier version of Envoy.

wbpcode commented 2 years ago

I'm not sure if I buy that incremental changes in encapsulation and abstraction are likely to cause that much slowdown. I glanced at one of the flame-graphs and it's hard to know what to improve there.

It might help to capture some perf-graphs from an instrumented binary that can provide more detail on what's going on. The purely sampled view we get from these flame-graphs might (for example, as a guess) hide some effects of changes in the way we make the networking system calls. E.g. we spend a lot of time in the 1.20 flamegraph in the kernel. Are we making larger numbers of calls for smaller chunks of data, for any reason?

It would be nice to get a controlled repro of this, using NIghthawk preferably, and then run envoy under callgrind or compiled with various perf tools, so we can see call-counts for various functions. Using this we could compare how we are structuring our system calls to the earlier version of Envoy.

I will try to do more investigate.

mattklein123 commented 2 years ago

Related to @jmarantz comment, over time we have generally moved to secure by default configuration. This almost universally is at odds with "horse race" benchmarks. So we will need to very carefully tease apart any differences. The work is useful but it's a lot more complicated than just looking at flame graphs. I also agree with @jmarantz that trying to run a reproducer under something like cachegrind will be a lot more informative.

jmarantz commented 2 years ago

Just to point this out: Matt suggests cachegrind and I suggested callgrind: they are related and both very useful:

they both are generated using the valgrind infrastructure (--tool=callgrind vs --tool=cachegrind)
cachegrind has a lot more detail of the simulated processor cache effects of each line of code
callgrind has more detail about time spent and call-counts at a function-level
they both benefit from an OptDebug compilation. I use --compilation_mode=opt --cxxopt=-g --cxxopt=-ggdb3 but those options might be dated as I haven't been able to do much direct technical work lately
they both generate reports that can be visualized with kcachegrind
they both run much slower than real time and I wouldn't run production traffic through them. But it should be no problem for nighthawk or any other synthetic load tool

wbpcode commented 2 years ago

@mattklein123 @jmarantz Thanks very much for all your suggestions. I will put more effort into tracking and investigating this issue recently.🌷

At present, I still subjectively think that it may caused by a large number of small adjustments. For example, #13423 introduces a small update to replace all white space chars in the response code details which will bring some minor new overhead. Assuming there are many similar PRs, the accumulation of these small overheads will still have a large enough impact. And because they are very scattered, it may also be difficult to locate.

But this is just my personal guess. We still need more research to identify the problem and try to solve it.

Of course, if it is just because of some more secure default configuration, then at most only the documentation needs to be updated.

wbpcode commented 2 years ago

2021/11/27 callgrind.out file of v1.12.2 vs v1.20.x (100, 000 HTTP1.1 request with wrk).

https://drive.google.com/drive/folders/1EWTjixvN43O8u24a_rJF8S6A4ePvFwC1?usp=sharing

v1.20.x

v1.12.2

First point: encodeFormattedHeader introduced ~3% external CPU overhead. Root cause: Too much fine grained buffer API access. Related PR: #9825

jmarantz commented 2 years ago

Great -- would you be able to supply the parameters passed to Nighthawk (preferably) or whatever tool you were using to generate the load for these traces?

Thanks!

jmarantz commented 2 years ago

This data is great! It really shows the profiles look very different. Just sorting by 'self' and comparing the two views provides a ton of possible issues to go explore.

Did you use an "OptDebug" build or was this a stock build or something?

wbpcode commented 2 years ago

I used simple hey to generate 100, 000 http1 request. Here is the command.

hey -n 100000 -c 20 http://localhost:9090.

jmarantz commented 2 years ago

I had not heard of hey. Is that this? https://github.com/rakyll/hey

wbpcode commented 2 years ago

Did you use an "OptDebug" build or was this a stock build or something?

I used the no stripped binary that build with ci/do_ci.sh bazel.release. It finally worked, but I'm not sure it is OptDebug mode.

wbpcode commented 2 years ago

I had not heard of hey. Is that this? https://github.com/rakyll/hey

Yes, I am generally used to using wrk, but it seems that wrk cannot generate a fixed amount of load.

jmarantz commented 2 years ago

Nighthawk (https://github.com/envoyproxy/nighthawk) is really what we want to converge to, as you get explicit control of open/closed loop, ability to generate http2, async vs concurrent, an indication of how many requests succeeded/failed, etc.

By "fixed amount of load" does this mean a finite number of requests? Or a fixed rate of requests?

The docker images are definitely not in OptDebug mode :) They are probably simply optimized, which is OK, but we'll have a lot less details on call-stack, where in functions time is being spent, etc.

wbpcode commented 2 years ago

Nighthawk (https://github.com/envoyproxy/nighthawk) is really what we want to converge to, as you get explicit control of open/closed loop, ability to generate http2, async vs concurrent, an indication of how many requests succeeded/failed, etc.

Thanks, I will try it in the coming test.

The docker images are definitely not in OptDebug mode :) They are probably simply optimized, which is OK, but we'll have a lot less details on call-stack, where in functions time is being spent, etc.

I see. Thanks. 🌷 I will try to build new binary with you suggested compile args and do some more investigates. But recently I only have enough time on weekends.

jmarantz commented 2 years ago

No worries -- the data you supplied is great. We'll definitely want to repro with NH though so we understand what we are comparing :)

My suspicion is that you've found something real, and based on the traces I looked at, there were changes in either tcmalloc's implementation, the way we allocate buffers, or both. The encodeFormattedHeader hot-spot probably deserves a quick look also as that is a relatively recent feature. Did you wind up turning that on?

What did you have running on localhost:8080? An Apache server or something?

wbpcode commented 2 years ago

What did you have running on localhost:8080? An Apache server or something?

A multi-processes Nginx that will return simple short string directly.

Did you wind up turning that on?

In fact, nope. Nevertheless, the new version of encodeFormatedHeader/encodeHeader still introduces more overhead. The reason is that encodeHeaders call encodeFormatedHeader several times and write some fine-grained data (a character, a space, a header name, etc.) to the buffer (watermark buffer) directly. The new version does not use a local cache to speed up the writing of these fine-grained data.

I've created a PR #19115 to try to solve this problem.

hobbytp commented 2 years ago

@wbpcode do you mean the issue only exists in HTTP1 or you only test it by using HTTP1, I ask this because I found in your PR #19115, you only fix HTTP1 codes, thanks for clarification.

wbpcode commented 2 years ago

@wbpcode do you mean the issue only exists in HTTP1 or you only test it by using HTTP1, I ask this because I found in your PR #19115, you only fix HTTP1 codes, thanks for clarification.

I only test it by using HTTP1. And codec is not the only problem. We need more investigations and PRs to continue to alleviate this problem.

lambdai commented 2 years ago

What's the envoy cpu usage between 1.12 and 1.20? I asked because I remember some event handle is pushed to the next poller cycle. It means envoy trades latency for fairness. Now that the benchmark client maintains the 20 active req/resp, it's likely the cpu usage at envoy is lower in 1.20

wbpcode commented 2 years ago

@lambdai

When I tested the throughput, both 1.12 and 1.20 used a full core. When I collected callgrind data later, because it was executed in valgrind, I didn't pay attention to the cpu situation. Moreover, when we analyze callgrind data, my main concern is the change in the proportion of overhead of different functions.

moderation commented 2 years ago

Via https://news.ycombinator.com/item?id=29457579 there are some good examples of projects that track performance over time:

rgs1 commented 2 years ago

Not to hijack this thread, but this might be possibly related: on our end we recently started looking at slow_alloc() dominating CPU time as seen in our view of Envoy via perf top:

Screen Shot 2021-12-10 at 2 33 02 PM

cc: @fishcakez

VA-PLE commented 2 years ago

Thank you for this information. We use envoy in high-load projects. It is strange to see such a drop in performance in the new version.

I hope the test will be replenished with new versions of the envoy and performance will improve over time

wbpcode commented 2 years ago

@VA-PLE We are working towards this.

jmarantz commented 2 years ago

Are 1.3 and 1.7 versions of something other than Envoy? Those seem like very low version numbers.

I think it probably makes sense to do more benchmarking on HEAD once #19115 lands. Then we should capture more traces to find the next bottleneck.

jiangdongzi commented 2 years ago

Are 1.3 and 1.7 versions of something other than Envoy? Those seem like very low version numbers.

I think it probably makes sense to do more benchmarking on HEAD once #19115 lands. Then we should capture more traces to find the next bottleneck.

Sorry, please just ignore my previous commont. When I use wrk test, I always test v1.3 first(The cache may be cold) This may gave me wrong answer. Just now, I retest, the result is the tps between v1.3 and v1.7 is very close; and from latency aspect, the v1.7 is slightly better. But v1.7 uses 20% much more cpu than v1.3. But the nighthawk result is always that v1.3 is 20% better than v1.7. I'll continue testing.

wbpcode commented 2 years ago

@jiangdongzi It would be better to compare some new versions. v1.3 and v1.7 look like too old. 🤔

jmarantz commented 2 years ago

Yeah they way too old to be useful Envoy versions. I suspect @jiangdongzi is referring to an Istio version number or some other packaging of Envoy.

In any case, there are a lot of complexities to benchmarking Envoy well, and it's worth reading up on:

https://github.com/envoyproxy/envoy/blob/main/bazel/PPROF.md
I thought @htuch wrote a white-paper on best practices for benchmarking Envoy but I can't find it -- maybe he can follow up with a pointer.

jmarantz commented 2 years ago

I found the material @htuch wrote but it's in the official docs, not a white paper: https://www.envoyproxy.io/docs/envoy/latest/faq/performance/how_to_benchmark_envoy

wbpcode commented 2 years ago

Some insteresting result from perf. The readDisable calls from HTTP1 codec result in ~7% cpu overhead.

ztgoto commented 2 years ago

I am on a machine with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz, --concurrency=1, all on one host, the proxy qps can only reach about 10k, and nginx (worker_processes 1, qps 27K) is a huge difference. And as the number of worker threads increases, the gap gets bigger and bigger. We are doing api gateway technology selection， Have you tested the performance comparison between envoy and nginx under the same environmental conditions?

zhxie commented 2 years ago

@ztgoto I have tested Envoy (envoy.filters.http.router) and Nginx (proxy_pass) as HTTP proxy, with both in single thread and have 64 conns loaded by fortio. Envoy has more than 10% performance gain compared to Nginx in RPS. Can you show example configurations in your case?

ztgoto commented 2 years ago

@ztgoto I have tested Envoy (envoy.filters.http.router) and Nginx (proxy_pass) as HTTP proxy, with both in single thread and have 64 conns loaded by fortio. Envoy has more than 10% performance gain compared to Nginx in RPS. Can you show example configurations in your case?

envoy

admin:
  access_log_path: /dev/null
  address:
    socket_address: {address: 0.0.0.0, port_value: 7777}

static_resources:
  listeners:
  - name: listener
    address:
      socket_address: {address: 0.0.0.0, port_value: 9104}
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: http_test
          codec_type: AUTO
          generate_request_id: false
          route_config:
            name: route
            virtual_hosts:
            - name: test
              domains: ["*"]
              routes:
              - match: { prefix: "/" }
                route:
                  cluster: nginx
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
              dynamic_stats: false
  clusters:
  - name: "nginx"
    connect_timeout: "5s"
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    circuit_breakers:
      thresholds:
      - priority: DEFAULT
        max_connections: 1000000000
        max_pending_requests: 1000000000
        max_requests: 1000000000
        max_retries: 0
      - priority: HIGH
        max_connections: 1000000000
        max_pending_requests: 1000000000
        max_requests: 1000000000
        max_retries: 0
    load_assignment:
      cluster_name: nginx
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 192.168.1.100
                port_value: 9101

stats_config:
  stats_matcher:
    reject_all: true

nginx

worker_processes  1;

events {
    worker_connections  1024;
}

http {
    include       mime.types;
    default_type  application/octet-stream;

    access_log off;

    sendfile        on;

    keepalive_timeout  65;

    upstream test {
      keepalive 50;
      server 192.168.1.100:9101;

    }

    server {
      listen 9103;
      server_name 192.168.4.119;
      location / {
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_pass http://test;
      }
    }

}

jmarantz commented 2 years ago

@krajshiva @yanavlasov here's some anecdotes about performance gaps between nginx and envoy.

jmarantz commented 2 years ago

@KBaichoo may also be interested

jmarantz commented 2 years ago

@ztgoto PTAL @ https://www.envoyproxy.io/docs/envoy/latest/faq/performance/how_to_benchmark_envoy

In particular, can you disable circuit-breaking and re-run your benchmark (per recommendations in the benchmarking doc above). It's not clear to me if that's a bottleneck in your case, but you have not configured circuit-breaking on the nginx side so it seems appropriate to keep the behavior consistent.

Also, did you build your own Envoy from source? Or use a pre-built package? If you built it yourself did you use -c opt?

KBaichoo commented 2 years ago

@ztgoto What HTTP protocol is being used? HTTP/1, or HTTP/2? In particular for HTTP2 in https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/protocol.proto#config-core-v3-http2protocoloptions max_concurrent_streams defaults to 2147483647 which I've seen this skew some benchmarks prior.

ztgoto commented 2 years ago

@ztgoto PTAL @ https://www.envoyproxy.io/docs/envoy/latest/faq/performance/how_to_benchmark_envoy

In particular, can you disable circuit-breaking and re-run your benchmark (per recommendations in the benchmarking doc above). It's not clear to me if that's a bottleneck in your case, but you have not configured circuit-breaking on the nginx side so it seems appropriate to keep the behavior consistent.

Also, did you build your own Envoy from source? Or use a pre-built package? If you built it yourself did you use -c opt?

@jmarantz I use envoyproxy/envoy:v1.22.2 (--network host) for testing. The config file is as stated above, I don't know if there is anything wrong.

envoy

wrk -t 8 -c 32  -d 60s --latency 'http://127.0.0.1:9104/hello'
Running 1m test @ http://127.0.0.1:9104/hello
  8 threads and 32 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.76ms  378.73us   6.53ms   66.32%
    Req/Sec     1.45k    65.16     2.12k    71.77%
  Latency Distribution
     50%    2.69ms
     75%    3.02ms
     90%    3.33ms
     99%    3.67ms
  693601 requests in 1.00m, 156.77MB read
Requests/sec:  11553.84
Transfer/sec:      2.61MB

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                          
160976 101       20   0 2247104  54512  19432 S 100.3  0.1   2:13.85 envoy                                                            
 32221 root      20   0  706612   8384   1160 S  32.8  0.0   0:04.63 wrk

nginx

wrk -t 8 -c 32  -d 60s --latency 'http://127.0.0.1:9103/hello'
Running 1m test @ http://127.0.0.1:9103/hello
  8 threads and 32 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.12ms  107.53us  13.65ms   78.66%
    Req/Sec     3.56k   202.05     4.44k    76.00%
  Latency Distribution
     50%    1.12ms
     75%    1.18ms
     90%    1.24ms
     99%    1.37ms
  1701221 requests in 1.00m, 379.64MB read
Requests/sec:  28351.42
Transfer/sec:      6.33MB

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                          
 58685 nobody    20   0   46588   2752   1116 R 100.0  0.0   0:10.66 nginx                                                            
 60538 root      20   0  701960   3728   1136 S  89.4  0.0   0:09.88 wrk

scheruku-in commented 1 year ago

Hi, We have upgraded our envoy version which is ~5 years old to latest v1.25.3. We have executed few perf tests and noticed that there seems to be 60% increase in CPU and 5x increase in memory compared to earlier version of envoy with same test. CPU went up-to 400m and memory was ~120M. Would some one please review the flame graph and share the findings if this can be further optimized? Thanks in advance.

Flame graph: https://github.com/scheruku-in/Envoy_Perf_Test/blob/main/envoy_highcpu.svg

Envoy config:

{
  "listeners": [
    {
      "address": "tcp://0.0.0.0:17600",
      "ssl_context": {
        "ca_cert_file": "envoycacert.pem",
        "cert_chain_file": "cacertchain.pem",
        "private_key_file": "key.pem",
        "alpn_protocols": "h2,http/1.1",
        "alt_alpn_protocols": "http/1.1"
      }, 
      "filters": [
        {
          "type": "read",
          "name": "http_connection_manager",
          "config": {
            "access_log": [
              {
                "format": "[%START_TIME%]  \"%REQ(X-FORWARDED-FOR)%\" - \"%REQ(USER-AGENT)%\"  \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE% %BYTES_RECEIVED% %BYTES_SENT%  x-request-id = \"%REQ(X-REQUEST-ID)%\" x-global-transaction-id = \"%REQ(X-GLOBAL-TRANSACTION-ID)%\" \"%REQ(:AUTHORITY)%\" \"%UPSTREAM_HOST%\" \"%UPSTREAM_CLUSTER%\"    rt=\"%DURATION%\" uct=\"%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%\" \n",
                "path": "/dev/stdout"
              }],
            "generate_request_id": true,
            "codec_type": "auto",
            "idle_timeout_s": 120,
            "stat_prefix": "ingress_http",
            "route_config": {
              "virtual_hosts": [
                {
                  "name": "service",
                  "domains": ["*"],
                  "require_ssl": "all",
                  "routes": [
                    {
                      "timeout_ms": 120000,
                      "retry_policy": {
                          "retry_on": "gateway-error,connect-failure",
                          "num_retries": 120
                          },
                      "prefix": "/",
                      "cluster_header" : "<cluster-header>"
                    }
                  ]
                }
              ]
            },
            "filters": [
              {
                "type": "decoder",
                "name": "router",
                "config": {}
              }
            ]
          }
        }
      ]
    }
  ],
  "admin": {
    "access_log_path": "/dev/stdout",
    "address": "tcp://127.0.0.1:8001"
  },
  "cluster_manager": {
    "clusters": [
    ],
    "cds": {
      "cluster": {
        "name": "cds",
        "connect_timeout_ms": 120000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://localhost:8081"
          }
        ]
      },
      "refresh_delay_ms": 100
    },
    "outlier_detection": {
      "event_log_path": "/dev/stdout"
    }
  }
}

Thanks in advance.

howardjohn commented 4 months ago

Update from 2024:

Version	QPS
1.13	65741
1.14	60302
1.15	52966
1.16	49947
1.17	45257
1.18	42121
1.19	41800
1.20	39946
1.21	39649
1.22	47829
1.23	45302
1.24	42135
1.25	41102
1.26	39932
1.27	41539
1.28	36995
1.29	37273
1.30	37962

``` crane-get-file () { crane export $1 - | tar -Oxf - $2 } for i in {13..30}; do crane-get-file envoyproxy/envoy:v1.$i-latest usr/local/bin/envoy > envoy-$i chmod +x envoy-$i done for i in {13..30}; do echo "STARTING $i" ./envoy-$i -c config.yaml --concurrency 1 --disable-hot-restart -l off & p=$! sleep .1 benchtool -q 0 -d 10 localhost:9090#envoy-1.$i >> res kill -9 $p echo "ENDING $i" done ```

ztgoto commented 4 months ago

2024: Returns the results directly, The stress testing tool uses wrk version: envoy：1.30.4 nginx: 1.26.1 config:

envoy ./envoy --concurrency 1 -c ./envoy.yaml

static_resources:
  listeners:
  - name: listener
    address:
      socket_address: {address: 0.0.0.0, port_value: 9101}
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: http_test
          codec_type: AUTO
          generate_request_id: false
          route_config:
            name: route
            virtual_hosts:
            - name: test
              domains: ["*"]
              routes:
              - match: { prefix: "/" }
                direct_response:
                  status: 200
                  body:
                    inline_string: "{\"message\":\"hello\"}"
                response_headers_to_add:
                - header:
                    key: "Content-Type"
                    value: "application/json"
              #  route:
              #    cluster: auth
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
              dynamic_stats: false

stats_config:
  stats_matcher:
    reject_all: true

nginx

worker_processes  1;

events {
    worker_connections  1024;
}

http {
    include       mime.types;
    default_type  application/octet-stream;

    access_log off;

    sendfile        on;

    keepalive_timeout  65;

    server {
        listen       9101;
        server_name  127.0.0.1;

        location / {
            default_type application/json;
            return 200 '{"message":"hello"}';
        }

        location = /50x.html {
            root   html;
        }

    }

}

result:

envoy:

./wrk -t 8 -c 1000 -d 60s 'http://127.0.0.1:9101'
Running 1m test @ http://127.0.0.1:9101
  8 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    69.34ms    7.74ms 123.77ms   64.42%
    Req/Sec     1.81k   521.16     4.62k    38.42%
  864304 requests in 1.00m, 117.05MB read
Requests/sec:  14382.08
Transfer/sec:      1.95MB

nginx:

./wrk -t 8 -c 1000 -d 60s 'http://127.0.0.1:9101'
Running 1m test @ http://127.0.0.1:9101
  8 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    39.28ms   99.20ms   2.00s    95.30%
    Req/Sec     6.09k   505.22     7.36k    72.21%
  2908817 requests in 1.00m, 479.91MB read
  Socket errors: connect 0, read 4672, write 0, timeout 13
Requests/sec:  48404.51
Transfer/sec:      7.99MB

Flame graph envoy: envoy-1.30.4

nginx: nginx-1.26.1

zhxie commented 4 months ago

2024: Returns the results directly, The stress testing tool uses wrk version: envoy：1.30.4 nginx: 1.26.1 config:

I have benchmarked Envoy with direct_response once with Fortio with keep-alive disabled. I noticed that Envoy does not close the connection immediately after sending the response, which leads to performance degradation. Your scenario differs from mine, but direct_response isn't commonly used and may not be fully optimized. I suggest testing Envoy in its typical workload as a router.

ztgoto commented 4 months ago

@zhxie I've done tests in routing scenarios before, and the results have been published above, and the performance gap is relatively large

envoyproxy / envoy

some performance results of envoy's different versions #19103