envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25.01k stars 4.81k forks source link

Performance Issues with Envoy #5536

Closed voley55 closed 4 years ago

voley55 commented 5 years ago

Benchmarking Envoy and comparing performance against HAProxy

Setup: LB : Envoy/HAProxy Backend : Nginx Benchmarking Tool : wrk (https://github.com/wg/wrk)

Envoy Config: Concurrency = 4

static_resources:
  listeners:
  - name: test
    address:
      socket_address:
        protocol: TCP
        address: 0.0.0.0
        port_value: 8090
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          stat_prefix: ingress_http
          generate_request_id: false
          route_config:
            name: test_routes
            virtual_hosts:
            - name: test_service
              domains:
              - "*"
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: test_backend
          http_filters:
          - name: envoy.router
            config:
              dynamic_stats: false
  clusters:
  - name: test_backend
    connect_timeout: 0.25s
    hosts:
    - socket_address:
        address: 172.16.x.x
        port_value: 8000

HAProxy Config:

global
  daemon
  maxconn 10000
  nbproc 4
defaults
  mode http
  timeout connect 5000ms
  timeout client 50000ms
  timeout server 50000ms
  http-reuse aggressive
frontend test
  bind *:8080
  acl test1 path_beg /
  use_backend test_backend if test1
backend test_backend
  server server1 172.16.x.x:8000

Nginx Config

worker_processes 4;
worker_connections 1000;
server {
    listen 8000 default_server;

    server_name test.com;
    access_log /var/log/nginx/test.access.log;
    error_log  /var/log/nginx/test.error.log;

    location / {
        return 200 'Woohoo!';
    }
}

Benchmark Results

Envoy $ wrk -c100 -d60s -t10 "http://172.16.x.x:8090/" --latency

Running 1m test @ http://172.16.x.x:8090/
  10 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.73ms    5.94ms 413.43ms   99.73%
    Req/Sec     4.00k     1.43k    9.36k    63.22%
  Latency Distribution
     50%    2.16ms
     75%    3.56ms
     90%    4.50ms
     99%    7.10ms
  2388909 requests in 1.00m, 389.58MB read
Requests/sec:  39748.81
Transfer/sec:      6.48MB

HAProxy $ wrk -c100 -d60s -t10 "http://172.16.x.x:8080/" --latency

Running 1m test @ http://172.16.x.x:8080/
  10 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.14ms  526.36us  31.04ms   89.20%
    Req/Sec     8.89k     1.79k   14.23k    63.36%
  Latency Distribution
     50%    1.05ms
     75%    1.32ms
     90%    1.63ms
     99%    2.20ms
  5315577 requests in 1.00m, 729.98MB read
Requests/sec:  88446.34
Transfer/sec:     12.15MB

Note:

Could you please point out where am I going wrong? (As according to various online blogs, envoy seems to provide equivalent performance as HAProxy)

brian-pane commented 5 years ago

I just ran a quick test and got similar results. I ran a CPU profiler on Envoy during the test, and it highlighted part of the reason for the performance difference: wrk uses HTTP/1.1, and Envoy uses a relatively CPU-intensive HTTP/1 parser. HTTP/1 parser performance is a longstanding issue; #2880 has some context.

bedis commented 5 years ago

What haproxy version? Haproxy config is not even optimal :)

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

brian-pane commented 5 years ago

@mattklein123 what do you think about tagging this one as “help wanted”? There definitely is performance work that would be worthwhile to do. I probably won’t have much time to contribute PRs over the next couple of months, but I’m happy to share profiling data and optimization ideas with anybody who’s interested in working on it.

mattklein123 commented 5 years ago

@brian-pane sure will do.

jobs-git commented 5 years ago

Why not use picohttpparser? That is the fastest http parser so far. It can parse about 5 Million requests per second

brian-pane commented 5 years ago

For anybody who's interested in doing some optimization work, here's a CPU profile of the latest Envoy trunk proxying small HTTP/1.1 requests and responses on Linux 4.15:

# Overhead  Command       Shared Object       Symbol                                                                                                                                                                         

# ........  ............  ..................  ...............................................................................................................................................................................
......................................................................................................................................................
#
     1.85%  envoy-static  envoy-static        [.] Envoy::Http::HeaderMapImpl::getExistingInline(char const*)
     1.79%  envoy-static  envoy-static        [.] http_parser_execute
     1.06%  envoy-static  [kernel.kallsyms]   [k] do_syscall_64
     0.56%  envoy-static  [kernel.kallsyms]   [k] __kmalloc_node_track_caller
     0.56%  envoy-static  [kernel.kallsyms]   [k] tcp_rcv_established
     0.53%  envoy-static  [kernel.kallsyms]   [k] __inet_lookup_established
     0.53%  envoy-static  [kernel.kallsyms]   [k] tcp_ack
     0.53%  envoy-static  [kernel.kallsyms]   [k] tcp_data_queue
     0.50%  envoy-static  [kernel.kallsyms]   [k] ip_finish_output2
     0.46%  envoy-static  [kernel.kallsyms]   [k] copy_user_generic_string
     0.43%  envoy-static  envoy-static        [.] Envoy::ThreadLocal::InstanceImpl::SlotImpl::get()
     0.40%  envoy-static  [kernel.kallsyms]   [k] __cgroup_bpf_run_filter_skb
     0.40%  envoy-static  [kernel.kallsyms]   [k] kmem_cache_alloc_node
     0.40%  envoy-static  [kernel.kallsyms]   [k] tcp_transmit_skb
     0.40%  envoy-static  [kernel.kallsyms]   [k] tcp_write_xmit
     0.40%  envoy-static  envoy-static        [.] Envoy::Router::Filter::onUpstreamHeaders(unsigned long, std::unique_ptr<Envoy::Http::HeaderMap, std::default_delete<Envoy::Http::HeaderMap> >&&, bool)
     0.36%  envoy-static  libpthread-2.27.so  [.] __pthread_mutex_lock
     0.33%  envoy-static  [kernel.kallsyms]   [k] __dev_queue_xmit
     0.33%  envoy-static  [kernel.kallsyms]   [k] __netif_receive_skb_core
     0.33%  envoy-static  [kernel.kallsyms]   [k] kmem_cache_free
     0.33%  envoy-static  [kernel.kallsyms]   [k] tcp_v4_rcv
     0.30%  envoy-static  [kernel.kallsyms]   [k] __softirqentry_text_start
     0.30%  envoy-static  [kernel.kallsyms]   [k] tcp_sendmsg_locked
     0.30%  envoy-static  envoy-static        [.] Envoy::Event::DispatcherImpl::timeSource()
     0.30%  envoy-static  envoy-static        [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
     0.30%  envoy-static  envoy-static        [.] event_process_active_single_queue.isra.31
     0.30%  envoy-static  envoy-static        [.] tc_newarray
     0.26%  envoy-static  [kernel.kallsyms]   [k] sock_poll
     0.26%  envoy-static  [kernel.kallsyms]   [k] tcp_clean_rtx_queue
     0.26%  envoy-static  envoy-static        [.] Envoy::Stats::CounterImpl<Envoy::Stats::RawStatData>::inc()
     0.26%  envoy-static  envoy-static        [.] event_add_nolock_
     0.26%  envoy-static  envoy-static        [.] tc_deletearray
     0.26%  envoy-static  libpthread-2.27.so  [.] __pthread_mutex_unlock
     0.23%  envoy-static  [kernel.kallsyms]   [k] __indirect_thunk_start
     0.23%  envoy-static  [kernel.kallsyms]   [k] eventfd_write
     0.23%  envoy-static  [kernel.kallsyms]   [k] ip_rcv
     0.23%  envoy-static  [kernel.kallsyms]   [k] ipv4_mtu
     0.23%  envoy-static  [kernel.kallsyms]   [k] process_backlog
     0.23%  envoy-static  envoy-static        [.] Envoy::Http::HeaderMapImpl::maybeCreateInline(Envoy::Http::HeaderMapImpl::HeaderEntryImpl**, Envoy::Http::LowerCaseString const&) [clone .part.348]
     0.23%  envoy-static  envoy-static        [.] Envoy::Router::Filter::UpstreamRequest::decodeHeaders(std::unique_ptr<Envoy::Http::HeaderMap, std::default_delete<Envoy::Http::HeaderMap> >&&, bool)
     0.23%  envoy-static  envoy-static        [.] event_del
     0.23%  envoy-static  libc-2.27.so        [.] __GI___readv
     0.23%  envoy-static  libc-2.27.so        [.] __GI___writev
     0.23%  envoy-static  libc-2.27.so        [.] __memmove_sse2_unaligned_erms
     0.23%  envoy-static  libc-2.27.so        [.] epoll_ctl
     0.20%  envoy-static  [kernel.kallsyms]   [k] do_iter_readv_writev
     0.20%  envoy-static  [kernel.kallsyms]   [k] do_iter_write
     0.20%  envoy-static  [kernel.kallsyms]   [k] enqueue_to_backlog
     0.20%  envoy-static  [kernel.kallsyms]   [k] eth_type_trans
     0.20%  envoy-static  [kernel.kallsyms]   [k] kfree
     0.20%  envoy-static  [kernel.kallsyms]   [k] tcp_send_ack
     0.20%  envoy-static  [vdso]              [.] __vdso_clock_gettime
     0.20%  envoy-static  envoy-static        [.] Envoy::Http::HeaderMapImpl::maybeCreateInline(Envoy::Http::HeaderMapImpl::HeaderEntryImpl**, Envoy::Http::LowerCaseString const&, Envoy::Http::HeaderString&&)
     0.20%  envoy-static  envoy-static        [.] Envoy::Network::IoSocketHandleImpl::writev(Envoy::Buffer::RawSlice const*, unsigned long)
     0.20%  envoy-static  envoy-static        [.] evbuffer_drain
     0.17%  envoy-static  [kernel.kallsyms]   [k] __fget_light
     0.17%  envoy-static  [kernel.kallsyms]   [k] __local_bh_enable_ip
     0.17%  envoy-static  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
huang195 commented 5 years ago

I ran into some similar performance issues with envoy recently using fortio to measure latency, but it was comparing envoy against itself in different settings, specifically, changing the number of parallel clients.

First, I tested envoy using only 1 client, 1000 request-per-second:

# fortio load -qps 1000 -c 1 -t 360s 10.97.103.115:8080
min 0.00046536 max 0.015038684
0.685 ms avg
# target 50% 0.000737631
# target 75% 0.000873767
# target 90% 0.000955449
# target 99% 0.00168218
# target 99.9% 0.00498846

which is pretty comparable with native performance. But then I increased the number of clients to 100, but kept the request-per-second the same, i.e., each of the 100 clients would be doing 10 qps, and here is the result:

# fortio load -qps 1000 -c 100 -t 360s 10.97.103.115:8080
min 0.000572737 max 0.040983921
5.851 ms avg
# target 50% 0.00571237
# target 75% 0.00734058
# target 90% 0.00880032
# target 99% 0.0116944
# target 99.9% 0.017128

It's a pretty significant difference in terms of average latency and long tail latency. I used perf and flamegraph to look at how things have changed from 1 client to 100 clients. The only thing I could spot was more significant http header processing time in the 100 clients case. I saw a couple other opened git issues related to http_parser being not efficient. But in my case here, it's the same number of packets going through envoy. Would http header processing overhead increase with more number of parallel clients? I would think it's more related to the number of requests. I was also using 4 envoy worker nodes. Is there some caching / TLB flush issue here that's causing this behavior?

@mattklein123 @brian-pane any inkling where I should be poking around here?

mattklein123 commented 5 years ago

@mattklein123 @brian-pane any inkling where I should be poking around here?

Sorry nothing immediately comes to mind. One of the first things I would probably look at is whether Fortio is itself introducing this latency. If that's not the case would need some more detailed investigation.

huang195 commented 5 years ago

So without envoy in the middle, fortio running as regular linux processes between 2 of my VMs looks like:

# fortio load -qps 1000 -c 100 -t 60s 9.59.150.106:8080
# target 50% 0.00184155
# target 75% 0.00264706
# target 90% 0.00347064
# target 99% 0.00526316
# target 99.9% 0.00919149
 2.025 ms avg

With a freshly compiled envoy sitting in front of the fortio server, the same experiment looks like this

fortio load -qps 1000 -c 100 -t 60s 9.59.150.106:8088
# target 50% 0.00685397
# target 75% 0.00931644
# target 90% 0.0110211
# target 99% 0.0159744
# target 99.9% 0.0296667
7.030 ms avg

There's a pretty significant increase in both average and long-tail latency after putting Envoy in there. I started Envoy by

./envoy -c envoy.json --concurrency 4

where envoy.json looks like:

{"static_resources": {
  "listeners": [
    {
      "address": { "socket_address": { "address": "0.0.0.0", "port_value": 8088 } },
      "filter_chains": {
        "filters": [
          {
            "name": "envoy.http_connection_manager",
            "config": {
              "stat_prefix": "ingress_http",
              "http_filters": [{
                "name": "envoy.router",
                "config": {}
              }],
              "route_config": {
                "virtual_hosts": [
                  {
                    "name": "fortio",
                    "domains": "*",
                    "routes": [
                      {
                        "route": {
                          "cluster": "fortio"
                        },
                        "match": {
                          "prefix": "/"
    }}]}]}}}]}}],
  "clusters":[
    {
      "name": "fortio",
      "type": "STRICT_DNS",
      "connect_timeout": "1s",
      "hosts": [
        { "socket_address": { "address": "127.0.0.1", "port_value": 8080 } }
      ]
    }
  ]
}}

This is almost like a hello-world example of using envoy, so I'm a bit surprised that latency number went up by 3x. I just started looking at envoy, so if there's anything obviously wrong with my setup, please let me know.

htuch commented 5 years ago

@oschaaf any thoughts on this thread? Could we use Nighthawk?

huang195 commented 5 years ago

FWIW, here is result for the same setup but switching out envoy for haproxy (using the haproxy config in the original post)

fortio load -qps 1000 -c 100 -t 60s 9.59.150.106:8088
# target 50% 0.00384799
# target 75% 0.00521333
# target 90% 0.0063642
# target 99% 0.00954595
# target 99.9% 0.0129949
3.970 ms avg
oschaaf commented 5 years ago

@htuch interesting, thanks for looping me in. I'll try to set up a repro to and analyse. It will be nice chance to give Nighthawk a spin

oschaaf commented 5 years ago

I ran an experiment on a low-latency tuned system for comparing average latencies accross wrk2 Fortio and Nighthawk, when running directly them against nginx serving a static file vs doing that through Envoy and HAProxy [1]. There's an order-of-magnitude difference between the latencies reported by the different benchmark clients. I'm not done yet, but I wanted to post an update here. I do think it will be interesting to figure out why wrk2 via Envoy reports higher latencies compared to wrk2 via HAProxy, but it's worth noting that Nighthawk reports some interesting log messages when going through HAProxy about premature responses.

[1] https://gist.github.com/oschaaf/a04a1d4a9fa3b571ad80ba2449761616#file-mean-summary-md

mandarjog commented 5 years ago

Can you check how many tcp connections are being opened from the proxy to upstream in all 3 cases? I have measured this type of behavior in istio tests for a while. Scaling the upstream cluster brings the latency back down.

huang195 commented 5 years ago

@oschaaf Looks like your fortio result using envoy and haproxy is pretty consistent with the ones I've posted earlier. So it's great this problem can be reproduced. Looking forward to seeing something that can be fixed to bring envoy's latency number down.

oschaaf commented 5 years ago

So after thinking about it, my 2p is that I think sanity checking the numbers is warranted. The test-server I used (nginx) will respond in roughly 50-200 microseconds on the test machine. If we want to say something about latencies in that order of magnitude, then preferably we are able to observe them at at least a 10X higher resolution (5-20 microseconds).

In the tests I ran, the mean latencies are reported by the tools as follows:

Direct Via Envoy Via HAProxy
Nighthawk 0.090076 ms 0.198846 ms 0.14126 ms
wrk2 0.78625 ms 3.33 ms 1.39 ms
Fortio 1.969 ms 5.971 ms 2.086 ms

Wildly different numbers are reported. Are any of the tools lying? Are we sure we are comparing apples to apples?

Fortio and wrk2 report numbers pretty that seem pretty far away from the truth in terms of abolute latency [1]. Why could that be?

We test @ 1000 rps. Given ideal request-release timings, there ought to be no overlap between request/response pairs. That's nice because queuing/batching effects could complicate our reasoning, and skew our measurements. Problem is both fortio and wrk2 rely on sleep for release timings, which isn't going to be super accurate. It is worth noting that wrk2 will sleep at a 1 ms resolution, exactly on par with our test-frequency. Also, the cpu hosting the client probably is far from saturated. This allows power state management to lower frequency, and other processes/threads may get scheduled accross. All these things can add significant noise to the experiment (except for Nighthawk, which performs a spin/yield loop when idle to avoid the cpu frequency scaling).

So given even and precise pacing of request-release timeings, it looks like HAProxy is able to achieve a slightly lower latency. Perhaps because of the aggressive keepalive setting used (wild guess).

But no, I don't feel we can confidently conclude Envoy is adding multiple ms to the mean in that scenario.

Furthermore, I do think it is worth adding an option to intentionally release batched sets of requests to Nighthawk, so we can make a differential diagnosis between:

[1] Lastly, I think it's worth sharing the earlier analysis of wrk and Fortio we did.

huang195 commented 5 years ago

This is super interesting. Thanks for the detailed analysis comparing different latency measurement tools. It seems wrk and fortio generate different absolute numbers, but their results are on the same order of magnitude, i.e. a few ms. Nighthawk on the other hand produces numbers that are an order of magnitude smaller.

In my environment, between the 2 VMs, ping shows about 0.5ms round trip latency, so when I get fortio result of 2ms average latency, it made some sense to me. And then putting haproxy and envoy between the 2 VMs, the round trip number further increased to 4 to 7ms. It's not ideal, but still made sense as we are adding software on the critical path.

What is the ping time in your testbed setup? Nighthawk showing 0.09ms average latency is kind of bizarrely small. On my relatively idle VM, ping on the loopback device shows round trip time of 0.05ms. Are your VMs provisioned on the same host machine and the network packets don't leave the host nic? I might have have missed your testbed setup if this was mentioned somewhere.

huang195 commented 5 years ago

OK, never mind the question about the testbed setup. You're indeed using loopback device in your experiment (https://gist.github.com/oschaaf/a04a1d4a9fa3b571ad80ba2449761616#file-mean-summary-md)

oschaaf commented 5 years ago

I ran the test on a z620 workstation, directly on the hardware, with the client and server running on two physically different HW cpu's, using taskset to pin them to a core. Also, the box tuned for low-latency. I think the difference between wrk2 and fortio is also remarkable, about 2X...

Ping times:

➜ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.044 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.015 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.012 ms
...
huang195 commented 5 years ago

@oschaaf In fortio, when -c is specified, you can create different number of parallel clients to generate a load. In the experiments above, with -qps=1000 and -c=100, it means at every 100ms, 100 requests will be sent to the envoy worker thread at almost the same time. Since you specified the number of envoy worker threads to be 1, only 1 request will be processed at a time, while the rest are waiting to be picked up by the event loop. This means the request getting picked up in the beginning will experience much smaller latency than those toward the end. This "thundering herd" problem might not be representative of a realistic workload where 100 clients all collude to send their requests at exactly the same time. This also explains the behavior I posted above when keeping -qps=1000 but changing -c=1 to -c=100, long tail latency became much worse. Does this make sense? Does nighthawk allow some sort of "de-synchronization" of client requests so they don't send these request bursts?

oschaaf commented 5 years ago

@huang195 yeah, above I ended with I do think it is worth adding an option to intentionally release batched sets of requests to.... for just that. So it is on the roadmap, but right now we only have LinearRateLimiter in Nighthawk. It shouldn't be that hard to add BatchedLinearRateLimiter :-)
Does fortio intentionally do batch and synchronise to release at the same time? I don't remember seeing that in the sources, but then again I may have overlooked that part... In any case, in the evaluation doc for fortio vs wrk2 we had something like this as a recommendation:

 fortio: Take measures to better spread the generated load in time. Currently requests are send in batches as timers expire for each connection/goroutine. Adding some random jitter to the pacing per goroutine might help. Coordinating timing explicitly may help too but would probably be hard to add in an efficient way.

As fortio will have to rely on the go scheduler for fairness, and because of garbage collection stalls, I think it's going to be (very) hard to do reliable and precise request-release-timing.

huang195 commented 5 years ago

@oschaaf Sorry, I must have missed that comment. I ended up doing more experiments using fortio by adding some jitters between their requests, so we don't get these mini-burst of requests. The latency result does look much better, while keeping everything else the same.

Related to this topic, in the context of Istio, it adds additional filters to Envoy, e.g., to report telemetry and tracing data. Most of these additional filters are executed off the critical path, so in theory they shouldn't have a big impact on end-to-end latency. But they do. The reason is every Envoy worker thread is single-threaded and running an event loop. While the thread is in the middle of preparing a batch report, serializing and compressing it before putting it onto the wire, one or more requests could have arrived. The low-priority work of preparing batch report does not get interrupted by higher priority work of handling requests, thus could cause increase in latency. First of all, is this a correct understanding of how things work in an Envoy worker thread? If so, is there any work in the pipeline that would address such issue?

oschaaf commented 5 years ago

@huang195 meantime, I've drafted an approach for Nighthawk to allow to release requests in batches: https://github.com/envoyproxy/nighthawk/pull/14 I'll run some experiments with that tomorrow as well.

htuch commented 5 years ago

@huang195 yeah, basically, that's the right model. It depends if the work is being done on the main thread or worker thread. But, doing a dispatcher post on worker thread A back to A's dispatcher for later processing, is just kicking the can down the road and will clearly add latency to any performance test.

I know @mandarjog is an expert on Istio performance issues, he can probably chime in as well.

iandyh commented 5 years ago

For those who are interested in performance of Envoy and Istio's Envoy, I've made a flamechart: https://github.com/iandyh/istio-enovy-perf/blob/master/1.1.1.svg

htuch commented 5 years ago

@iandyh that's really useful. Looks like ~30% of CPU cycles are being burnt on tracing/Mixer filter. This seems to match up roughly to the difference between Envoy and HAProxy based on the initial Nighthawk latency measurement.

huang195 commented 5 years ago

@iandyh thanks for sharing that. I saw something very similar to that as well - a significant amount of CPU being spent in telemetry and tracing, and http parsing.

There are already a few issues opened about the performance problems of the http_parse library that's being used, so hopefully that can be addressed. I'm more interested in all these additional stuff that Istio filters added off the critical path. Since each Envoy worker thread is single-threaded, the additional work it does running the Istio filters could actually block user requests that are in the event queue. I'm wondering if someone has looked into breaking worker threads into top and bottom halves, so that higher priority work doesn't get blocked by lower ones, and if needed, can preempt lower ones.

To address latency, while the current work-around is to add more Envoy worker threads to reduce the effect of queuing and user-perceived latency, more worker threads also mean more resource usage. I'm wondering if there's a middle ground that could reduce latency by processing queued requests as quickly as possible while pushing telemetry/tracing and other non-essential work to when all the critical-path work are done.

mattklein123 commented 5 years ago

There are already a few issues opened about the performance problems of the http_parse library that's being used, so hopefully that can be addressed.

To set expectations, any perf gains there are going to be minimal. There is no low hanging fruit. Also, HTTP/1 is becoming increasingly irrelevant so I'm not sure how much it matters to most real world workloads.

I'm more interested in all these additional stuff that Istio filters added off the critical path.

Istio has well known performance problems that they are working to address with Mixer v2. I would recommend following up with them as this is really outside the scope of the core Envoy project.

I'm wondering if there's a middle ground that could reduce latency by processing queued requests as quickly as possible while pushing telemetry/tracing and other non-essential work to when all the critical-path work are done.

This is possible, but as @htuch already mentioned, it's unclear to me that the extra complexity is going to yield substantial benefit since this is really just "kicking the can down the road." Fundamentally, the used features just need to use less CPU cycles. If you feel like prototyping some changes please feel free, but I think the performance gains have to be substantial to offset the increased complexity.

huang195 commented 5 years ago

@mattklein123 thanks for your response. Regarding to filters, to someone who's new to Envoy, it might not be immediately obvious that a piece of filter code that runs off the critical path could have a significant impact to requests on the critical path under certain conditions. libevent supports multiple priorities, but those are currently not used in Envoy today. Perhaps, it's worth exploring in alleviating this problem. These are just my initial thoughts, but like what you and @htuch already mentioned, the gain might not offset the additional complexity. Anyways, just wanted to gauge if there were some thoughts in this area previously but weren't explored further for whatever reason.

htuch commented 5 years ago

We could certainly consider threads or events with priorities, but none of that would help if you're measuring performance at close to serving capacity, since there just won't be enough CPU to go around and you either indefinitely delay the stats or drop them or pay the latency cost somewhere.

When operating with plenty of spare CPU capacity, I agree, this approach would yield some benefit for filters that do low priority deferred actions, but I think Mixer is one of the few we've seen in the wild that do this.

huang195 commented 5 years ago

@htuch yep, agree. What I've seen sometimes is benchmarks themselves being the cause of excessive queuing delays. This was described earlier with fortio parallel clients, in which case, there is plenty of spare CPU, but due to 4 worker threads handling 100 parallel connections, queued requests could experience long-tail latency issues because instead of marshaling packets back and forth, they spend more time massaging/compressing/sending telemetry and tracing data, uninterrupted. Coupled with the fact that sometimes OS doesn't hand out work equally amongst the 4 worker threads, some threads could get more work than others, thus further exacerbating the latency issue. Real world workloads might not have this colluding problem of parallel clients in forito that syncs their requests, but that's not controllable by us, and sometimes these bursts do happen.

If mixer filter could be optimized, that would alleviate the problem. But the fact that the core eventing systems allows lower priority work to continue while there is higher priority work to do is an area that could be improved. Of course this only makes sense when there is still spare CPU capacity.

codervinod commented 5 years ago

My investigation has shown that the performance further degrades if access_log is turned on. I see about 40% degrade with wrk benchmark. Anyone else noticing significant degrades with access_log turned on?

htuch commented 5 years ago

@codervinod is this gRPC access log or filesystem? I would expect any remote access logging where you are sending a RPC 1:1 to have this kind of impact for small requests.

mandarjog commented 5 years ago

@huang195 regarding kicking the van down the road, if there is always some work waiting to be picked up, moving path length to the log filter still adds to latency. At low rps this is not noticeable, but pronounced at higher rps.

Have you done any experiments with the grpc accesslog service? The current implementation does not batch, sends a message for every request. We are working on another accesslog extension for stack driver that does batch but is more selective in what it sends.

Mixer client is an access log adapter that is batched but send absolutely everything with every request.

I think ultimately some balance of compression, batching and selectiveness will win out.

huang195 commented 5 years ago

@mandarjog I haven't done any experiments with the grpc accesslog service yet. From what I can gather, there are some implicit constraints about running an Envoy filter due to i) the number of worker threads is fixed, ii) once a worker thread "accepts" a new connection, it cannot hand it off to another thread even if it is overloaded and the other is idling, iii) each worker is single-threaded.

So to minimize performance variance, an Envoy filter should not have vastly different code execution paths. In mixer client today, the default behavior is 99 out 100 requests will go through the same code execution path, but the 100th request will go to a much longer path, i.e., protobuf serialization, compression, network I/O. This is one cause of the long-tail latency problem. But variance can be caused by other things, such as load, multitenancy, randomness in CPU scheduling, etc.

The mental diagram I have about Envoy filter is that it has a critical path and a non-critical path. Policy/Quota checking and request processing/forwarding are on the critical path. Telemetry/tracing/and other stuff that happens afterwards are on the non-critical path. Due to Envoy's single-threadedness, non-critical path could sometimes impact critical path. Code on the critical path needs to run regardless. Code on the non-critical path, on the other hand, should have control points that in the most extreme case, could be short circuited completely, freeing the worker thread to work on something more pressing. Envoy internally keeps a lot of runtime stats, which could be used as triggers to these control points.

cstrahan commented 5 years ago

In the scenario where latency is due to 1:1 RPC hogging CPU time, and assuming that the RPC server is on the same machine, this might be worth considering: instead of using protobuf/gRPC, use a zero-copy, (de)serialization-free message format (like Cap'n Proto) and pass messages via file descriptor over Unix domain socket (the memory for the message(s) would be allocated via shm or memfd_create, and the message format allows for reading and writing directly to this mapped memory, making this possible). That at least skips a serialize and deserialize step, as well as the copy out of Envoy and into the RPC process. Off course, now the RPC server still needs to be careful about not burning too many CPU cycles doing whatever it's going to do with e.g. the logs or whatever, but at least it takes that concern off of Envoy's shoulders.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

mandarjog commented 3 years ago

@huang195 you had an experiment that showed much better latency when you change Linux scheduler settings, do you have a link to the documentation? cc @iandyh

Neil-Boyle commented 2 years ago

same problem ,up to 60% cpu in stress test,

is there any new solution