Closed voley55 closed 4 years ago
I just ran a quick test and got similar results. I ran a CPU profiler on Envoy during the test, and it highlighted part of the reason for the performance difference: wrk uses HTTP/1.1, and Envoy uses a relatively CPU-intensive HTTP/1 parser. HTTP/1 parser performance is a longstanding issue; #2880 has some context.
What haproxy version? Haproxy config is not even optimal :)
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.
@mattklein123 what do you think about tagging this one as “help wanted”? There definitely is performance work that would be worthwhile to do. I probably won’t have much time to contribute PRs over the next couple of months, but I’m happy to share profiling data and optimization ideas with anybody who’s interested in working on it.
@brian-pane sure will do.
Why not use picohttpparser? That is the fastest http parser so far. It can parse about 5 Million requests per second
For anybody who's interested in doing some optimization work, here's a CPU profile of the latest Envoy trunk proxying small HTTP/1.1 requests and responses on Linux 4.15:
# Overhead Command Shared Object Symbol
# ........ ............ .................. ...............................................................................................................................................................................
......................................................................................................................................................
#
1.85% envoy-static envoy-static [.] Envoy::Http::HeaderMapImpl::getExistingInline(char const*)
1.79% envoy-static envoy-static [.] http_parser_execute
1.06% envoy-static [kernel.kallsyms] [k] do_syscall_64
0.56% envoy-static [kernel.kallsyms] [k] __kmalloc_node_track_caller
0.56% envoy-static [kernel.kallsyms] [k] tcp_rcv_established
0.53% envoy-static [kernel.kallsyms] [k] __inet_lookup_established
0.53% envoy-static [kernel.kallsyms] [k] tcp_ack
0.53% envoy-static [kernel.kallsyms] [k] tcp_data_queue
0.50% envoy-static [kernel.kallsyms] [k] ip_finish_output2
0.46% envoy-static [kernel.kallsyms] [k] copy_user_generic_string
0.43% envoy-static envoy-static [.] Envoy::ThreadLocal::InstanceImpl::SlotImpl::get()
0.40% envoy-static [kernel.kallsyms] [k] __cgroup_bpf_run_filter_skb
0.40% envoy-static [kernel.kallsyms] [k] kmem_cache_alloc_node
0.40% envoy-static [kernel.kallsyms] [k] tcp_transmit_skb
0.40% envoy-static [kernel.kallsyms] [k] tcp_write_xmit
0.40% envoy-static envoy-static [.] Envoy::Router::Filter::onUpstreamHeaders(unsigned long, std::unique_ptr<Envoy::Http::HeaderMap, std::default_delete<Envoy::Http::HeaderMap> >&&, bool)
0.36% envoy-static libpthread-2.27.so [.] __pthread_mutex_lock
0.33% envoy-static [kernel.kallsyms] [k] __dev_queue_xmit
0.33% envoy-static [kernel.kallsyms] [k] __netif_receive_skb_core
0.33% envoy-static [kernel.kallsyms] [k] kmem_cache_free
0.33% envoy-static [kernel.kallsyms] [k] tcp_v4_rcv
0.30% envoy-static [kernel.kallsyms] [k] __softirqentry_text_start
0.30% envoy-static [kernel.kallsyms] [k] tcp_sendmsg_locked
0.30% envoy-static envoy-static [.] Envoy::Event::DispatcherImpl::timeSource()
0.30% envoy-static envoy-static [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
0.30% envoy-static envoy-static [.] event_process_active_single_queue.isra.31
0.30% envoy-static envoy-static [.] tc_newarray
0.26% envoy-static [kernel.kallsyms] [k] sock_poll
0.26% envoy-static [kernel.kallsyms] [k] tcp_clean_rtx_queue
0.26% envoy-static envoy-static [.] Envoy::Stats::CounterImpl<Envoy::Stats::RawStatData>::inc()
0.26% envoy-static envoy-static [.] event_add_nolock_
0.26% envoy-static envoy-static [.] tc_deletearray
0.26% envoy-static libpthread-2.27.so [.] __pthread_mutex_unlock
0.23% envoy-static [kernel.kallsyms] [k] __indirect_thunk_start
0.23% envoy-static [kernel.kallsyms] [k] eventfd_write
0.23% envoy-static [kernel.kallsyms] [k] ip_rcv
0.23% envoy-static [kernel.kallsyms] [k] ipv4_mtu
0.23% envoy-static [kernel.kallsyms] [k] process_backlog
0.23% envoy-static envoy-static [.] Envoy::Http::HeaderMapImpl::maybeCreateInline(Envoy::Http::HeaderMapImpl::HeaderEntryImpl**, Envoy::Http::LowerCaseString const&) [clone .part.348]
0.23% envoy-static envoy-static [.] Envoy::Router::Filter::UpstreamRequest::decodeHeaders(std::unique_ptr<Envoy::Http::HeaderMap, std::default_delete<Envoy::Http::HeaderMap> >&&, bool)
0.23% envoy-static envoy-static [.] event_del
0.23% envoy-static libc-2.27.so [.] __GI___readv
0.23% envoy-static libc-2.27.so [.] __GI___writev
0.23% envoy-static libc-2.27.so [.] __memmove_sse2_unaligned_erms
0.23% envoy-static libc-2.27.so [.] epoll_ctl
0.20% envoy-static [kernel.kallsyms] [k] do_iter_readv_writev
0.20% envoy-static [kernel.kallsyms] [k] do_iter_write
0.20% envoy-static [kernel.kallsyms] [k] enqueue_to_backlog
0.20% envoy-static [kernel.kallsyms] [k] eth_type_trans
0.20% envoy-static [kernel.kallsyms] [k] kfree
0.20% envoy-static [kernel.kallsyms] [k] tcp_send_ack
0.20% envoy-static [vdso] [.] __vdso_clock_gettime
0.20% envoy-static envoy-static [.] Envoy::Http::HeaderMapImpl::maybeCreateInline(Envoy::Http::HeaderMapImpl::HeaderEntryImpl**, Envoy::Http::LowerCaseString const&, Envoy::Http::HeaderString&&)
0.20% envoy-static envoy-static [.] Envoy::Network::IoSocketHandleImpl::writev(Envoy::Buffer::RawSlice const*, unsigned long)
0.20% envoy-static envoy-static [.] evbuffer_drain
0.17% envoy-static [kernel.kallsyms] [k] __fget_light
0.17% envoy-static [kernel.kallsyms] [k] __local_bh_enable_ip
0.17% envoy-static [kernel.kallsyms] [k] _raw_spin_lock_irqsave
I ran into some similar performance issues with envoy recently using fortio
to measure latency, but it was comparing envoy against itself in different settings, specifically, changing the number of parallel clients.
First, I tested envoy using only 1 client, 1000 request-per-second:
# fortio load -qps 1000 -c 1 -t 360s 10.97.103.115:8080
min 0.00046536 max 0.015038684
0.685 ms avg
# target 50% 0.000737631
# target 75% 0.000873767
# target 90% 0.000955449
# target 99% 0.00168218
# target 99.9% 0.00498846
which is pretty comparable with native performance. But then I increased the number of clients to 100, but kept the request-per-second the same, i.e., each of the 100 clients would be doing 10 qps, and here is the result:
# fortio load -qps 1000 -c 100 -t 360s 10.97.103.115:8080
min 0.000572737 max 0.040983921
5.851 ms avg
# target 50% 0.00571237
# target 75% 0.00734058
# target 90% 0.00880032
# target 99% 0.0116944
# target 99.9% 0.017128
It's a pretty significant difference in terms of average latency and long tail latency. I used perf
and flamegraph
to look at how things have changed from 1 client to 100 clients. The only thing I could spot was more significant http header processing time in the 100 clients case. I saw a couple other opened git issues related to http_parser
being not efficient. But in my case here, it's the same number of packets going through envoy. Would http header processing overhead increase with more number of parallel clients? I would think it's more related to the number of requests. I was also using 4 envoy worker nodes. Is there some caching / TLB flush issue here that's causing this behavior?
@mattklein123 @brian-pane any inkling where I should be poking around here?
@mattklein123 @brian-pane any inkling where I should be poking around here?
Sorry nothing immediately comes to mind. One of the first things I would probably look at is whether Fortio is itself introducing this latency. If that's not the case would need some more detailed investigation.
So without envoy in the middle, fortio running as regular linux processes between 2 of my VMs looks like:
# fortio load -qps 1000 -c 100 -t 60s 9.59.150.106:8080
# target 50% 0.00184155
# target 75% 0.00264706
# target 90% 0.00347064
# target 99% 0.00526316
# target 99.9% 0.00919149
2.025 ms avg
With a freshly compiled envoy sitting in front of the fortio server, the same experiment looks like this
fortio load -qps 1000 -c 100 -t 60s 9.59.150.106:8088
# target 50% 0.00685397
# target 75% 0.00931644
# target 90% 0.0110211
# target 99% 0.0159744
# target 99.9% 0.0296667
7.030 ms avg
There's a pretty significant increase in both average and long-tail latency after putting Envoy in there. I started Envoy by
./envoy -c envoy.json --concurrency 4
where envoy.json looks like:
{"static_resources": {
"listeners": [
{
"address": { "socket_address": { "address": "0.0.0.0", "port_value": 8088 } },
"filter_chains": {
"filters": [
{
"name": "envoy.http_connection_manager",
"config": {
"stat_prefix": "ingress_http",
"http_filters": [{
"name": "envoy.router",
"config": {}
}],
"route_config": {
"virtual_hosts": [
{
"name": "fortio",
"domains": "*",
"routes": [
{
"route": {
"cluster": "fortio"
},
"match": {
"prefix": "/"
}}]}]}}}]}}],
"clusters":[
{
"name": "fortio",
"type": "STRICT_DNS",
"connect_timeout": "1s",
"hosts": [
{ "socket_address": { "address": "127.0.0.1", "port_value": 8080 } }
]
}
]
}}
This is almost like a hello-world example of using envoy, so I'm a bit surprised that latency number went up by 3x. I just started looking at envoy, so if there's anything obviously wrong with my setup, please let me know.
@oschaaf any thoughts on this thread? Could we use Nighthawk?
FWIW, here is result for the same setup but switching out envoy for haproxy (using the haproxy config in the original post)
fortio load -qps 1000 -c 100 -t 60s 9.59.150.106:8088
# target 50% 0.00384799
# target 75% 0.00521333
# target 90% 0.0063642
# target 99% 0.00954595
# target 99.9% 0.0129949
3.970 ms avg
@htuch interesting, thanks for looping me in. I'll try to set up a repro to and analyse. It will be nice chance to give Nighthawk a spin
I ran an experiment on a low-latency tuned system for comparing average latencies accross wrk2
Fortio
and Nighthawk
, when running directly them against nginx serving a static file vs doing that through Envoy
and HAProxy
[1]. There's an order-of-magnitude difference between the latencies reported by the different benchmark clients. I'm not done yet, but I wanted to post an update here. I do think it will be interesting to figure out why wrk2
via Envoy
reports higher latencies compared to wrk2
via HAProxy
, but it's worth noting that Nighthawk
reports some interesting log messages when going through HAProxy
about premature responses.
[1] https://gist.github.com/oschaaf/a04a1d4a9fa3b571ad80ba2449761616#file-mean-summary-md
Can you check how many tcp connections are being opened from the proxy to upstream in all 3 cases? I have measured this type of behavior in istio tests for a while. Scaling the upstream cluster brings the latency back down.
@oschaaf Looks like your fortio result using envoy and haproxy is pretty consistent with the ones I've posted earlier. So it's great this problem can be reproduced. Looking forward to seeing something that can be fixed to bring envoy's latency number down.
So after thinking about it, my 2p is that I think sanity checking the numbers is warranted. The test-server I used (nginx) will respond in roughly 50-200 microseconds on the test machine. If we want to say something about latencies in that order of magnitude, then preferably we are able to observe them at at least a 10X higher resolution (5-20 microseconds).
In the tests I ran, the mean latencies are reported by the tools as follows:
Direct | Via Envoy | Via HAProxy | |
---|---|---|---|
Nighthawk | 0.090076 ms | 0.198846 ms | 0.14126 ms |
wrk2 | 0.78625 ms | 3.33 ms | 1.39 ms |
Fortio | 1.969 ms | 5.971 ms | 2.086 ms |
Wildly different numbers are reported. Are any of the tools lying? Are we sure we are comparing apples to apples?
Fortio and wrk2 report numbers pretty that seem pretty far away from the truth in terms of abolute latency [1]. Why could that be?
We test @ 1000 rps. Given ideal request-release timings, there ought to be no overlap between request/response pairs. That's nice because queuing/batching effects could complicate our reasoning, and skew our measurements. Problem is both fortio and wrk2 rely on sleep for release timings, which isn't going to be super accurate. It is worth noting that wrk2 will sleep at a 1 ms resolution, exactly on par with our test-frequency. Also, the cpu hosting the client probably is far from saturated. This allows power state management to lower frequency, and other processes/threads may get scheduled accross. All these things can add significant noise to the experiment (except for Nighthawk, which performs a spin/yield loop when idle to avoid the cpu frequency scaling).
So given even and precise pacing of request-release timeings, it looks like HAProxy is able to achieve a slightly lower latency. Perhaps because of the aggressive keepalive setting used (wild guess).
But no, I don't feel we can confidently conclude Envoy is adding multiple ms to the mean in that scenario.
Furthermore, I do think it is worth adding an option to intentionally release batched sets of requests to Nighthawk, so we can make a differential diagnosis between:
[1] Lastly, I think it's worth sharing the earlier analysis of wrk and Fortio we did.
This is super interesting. Thanks for the detailed analysis comparing different latency measurement tools. It seems wrk and fortio generate different absolute numbers, but their results are on the same order of magnitude, i.e. a few ms. Nighthawk on the other hand produces numbers that are an order of magnitude smaller.
In my environment, between the 2 VMs, ping shows about 0.5ms round trip latency, so when I get fortio result of 2ms average latency, it made some sense to me. And then putting haproxy and envoy between the 2 VMs, the round trip number further increased to 4 to 7ms. It's not ideal, but still made sense as we are adding software on the critical path.
What is the ping time in your testbed setup? Nighthawk showing 0.09ms average latency is kind of bizarrely small. On my relatively idle VM, ping on the loopback device shows round trip time of 0.05ms. Are your VMs provisioned on the same host machine and the network packets don't leave the host nic? I might have have missed your testbed setup if this was mentioned somewhere.
OK, never mind the question about the testbed setup. You're indeed using loopback device in your experiment (https://gist.github.com/oschaaf/a04a1d4a9fa3b571ad80ba2449761616#file-mean-summary-md)
I ran the test on a z620 workstation, directly on the hardware, with the client and server running on two physically different HW cpu's, using taskset to pin them to a core. Also, the box tuned for low-latency. I think the difference between wrk2 and fortio is also remarkable, about 2X...
Ping times:
➜ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.044 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.015 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.012 ms
...
@oschaaf In fortio, when -c
is specified, you can create different number of parallel clients to generate a load. In the experiments above, with -qps=1000
and -c=100
, it means at every 100ms, 100 requests will be sent to the envoy worker thread at almost the same time. Since you specified the number of envoy worker threads to be 1, only 1 request will be processed at a time, while the rest are waiting to be picked up by the event loop. This means the request getting picked up in the beginning will experience much smaller latency than those toward the end. This "thundering herd" problem might not be representative of a realistic workload where 100 clients all collude to send their requests at exactly the same time. This also explains the behavior I posted above when keeping -qps=1000
but changing -c=1
to -c=100
, long tail latency became much worse. Does this make sense? Does nighthawk allow some sort of "de-synchronization" of client requests so they don't send these request bursts?
@huang195 yeah, above I ended with I do think it is worth adding an option to intentionally release batched sets of requests to....
for just that. So it is on the roadmap, but right now we only have
LinearRateLimiter
in Nighthawk. It shouldn't be that hard to add BatchedLinearRateLimiter
:-)
Does fortio intentionally do batch and synchronise to release at the same time? I don't remember seeing that in the sources, but then again I may have overlooked that part... In any case, in the evaluation doc for fortio vs wrk2 we had something like this as a recommendation:
fortio: Take measures to better spread the generated load in time. Currently requests are send in batches as timers expire for each connection/goroutine. Adding some random jitter to the pacing per goroutine might help. Coordinating timing explicitly may help too but would probably be hard to add in an efficient way.
As fortio will have to rely on the go scheduler for fairness, and because of garbage collection stalls, I think it's going to be (very) hard to do reliable and precise request-release-timing.
@oschaaf Sorry, I must have missed that comment. I ended up doing more experiments using fortio by adding some jitters between their requests, so we don't get these mini-burst of requests. The latency result does look much better, while keeping everything else the same.
Related to this topic, in the context of Istio, it adds additional filters to Envoy, e.g., to report telemetry and tracing data. Most of these additional filters are executed off the critical path, so in theory they shouldn't have a big impact on end-to-end latency. But they do. The reason is every Envoy worker thread is single-threaded and running an event loop. While the thread is in the middle of preparing a batch report, serializing and compressing it before putting it onto the wire, one or more requests could have arrived. The low-priority work of preparing batch report does not get interrupted by higher priority work of handling requests, thus could cause increase in latency. First of all, is this a correct understanding of how things work in an Envoy worker thread? If so, is there any work in the pipeline that would address such issue?
@huang195 meantime, I've drafted an approach for Nighthawk
to allow to release requests in batches:
https://github.com/envoyproxy/nighthawk/pull/14
I'll run some experiments with that tomorrow as well.
@huang195 yeah, basically, that's the right model. It depends if the work is being done on the main thread or worker thread. But, doing a dispatcher post on worker thread A back to A's dispatcher for later processing, is just kicking the can down the road and will clearly add latency to any performance test.
I know @mandarjog is an expert on Istio performance issues, he can probably chime in as well.
For those who are interested in performance of Envoy and Istio's Envoy, I've made a flamechart: https://github.com/iandyh/istio-enovy-perf/blob/master/1.1.1.svg
@iandyh that's really useful. Looks like ~30% of CPU cycles are being burnt on tracing/Mixer filter. This seems to match up roughly to the difference between Envoy and HAProxy based on the initial Nighthawk latency measurement.
@iandyh thanks for sharing that. I saw something very similar to that as well - a significant amount of CPU being spent in telemetry and tracing, and http parsing.
There are already a few issues opened about the performance problems of the http_parse
library that's being used, so hopefully that can be addressed. I'm more interested in all these additional stuff that Istio filters added off the critical path. Since each Envoy worker thread is single-threaded, the additional work it does running the Istio filters could actually block user requests that are in the event queue. I'm wondering if someone has looked into breaking worker threads into top and bottom halves, so that higher priority work doesn't get blocked by lower ones, and if needed, can preempt lower ones.
To address latency, while the current work-around is to add more Envoy worker threads to reduce the effect of queuing and user-perceived latency, more worker threads also mean more resource usage. I'm wondering if there's a middle ground that could reduce latency by processing queued requests as quickly as possible while pushing telemetry/tracing and other non-essential work to when all the critical-path work are done.
There are already a few issues opened about the performance problems of the http_parse library that's being used, so hopefully that can be addressed.
To set expectations, any perf gains there are going to be minimal. There is no low hanging fruit. Also, HTTP/1 is becoming increasingly irrelevant so I'm not sure how much it matters to most real world workloads.
I'm more interested in all these additional stuff that Istio filters added off the critical path.
Istio has well known performance problems that they are working to address with Mixer v2. I would recommend following up with them as this is really outside the scope of the core Envoy project.
I'm wondering if there's a middle ground that could reduce latency by processing queued requests as quickly as possible while pushing telemetry/tracing and other non-essential work to when all the critical-path work are done.
This is possible, but as @htuch already mentioned, it's unclear to me that the extra complexity is going to yield substantial benefit since this is really just "kicking the can down the road." Fundamentally, the used features just need to use less CPU cycles. If you feel like prototyping some changes please feel free, but I think the performance gains have to be substantial to offset the increased complexity.
@mattklein123 thanks for your response. Regarding to filters, to someone who's new to Envoy, it might not be immediately obvious that a piece of filter code that runs off the critical path could have a significant impact to requests on the critical path under certain conditions. libevent
supports multiple priorities, but those are currently not used in Envoy today. Perhaps, it's worth exploring in alleviating this problem. These are just my initial thoughts, but like what you and @htuch already mentioned, the gain might not offset the additional complexity. Anyways, just wanted to gauge if there were some thoughts in this area previously but weren't explored further for whatever reason.
We could certainly consider threads or events with priorities, but none of that would help if you're measuring performance at close to serving capacity, since there just won't be enough CPU to go around and you either indefinitely delay the stats or drop them or pay the latency cost somewhere.
When operating with plenty of spare CPU capacity, I agree, this approach would yield some benefit for filters that do low priority deferred actions, but I think Mixer is one of the few we've seen in the wild that do this.
@htuch yep, agree. What I've seen sometimes is benchmarks themselves being the cause of excessive queuing delays. This was described earlier with fortio
parallel clients, in which case, there is plenty of spare CPU, but due to 4 worker threads handling 100 parallel connections, queued requests could experience long-tail latency issues because instead of marshaling packets back and forth, they spend more time massaging/compressing/sending telemetry and tracing data, uninterrupted. Coupled with the fact that sometimes OS doesn't hand out work equally amongst the 4 worker threads, some threads could get more work than others, thus further exacerbating the latency issue. Real world workloads might not have this colluding problem of parallel clients in forito
that syncs their requests, but that's not controllable by us, and sometimes these bursts do happen.
If mixer filter could be optimized, that would alleviate the problem. But the fact that the core eventing systems allows lower priority work to continue while there is higher priority work to do is an area that could be improved. Of course this only makes sense when there is still spare CPU capacity.
My investigation has shown that the performance further degrades if access_log is turned on. I see about 40% degrade with wrk benchmark. Anyone else noticing significant degrades with access_log turned on?
@codervinod is this gRPC access log or filesystem? I would expect any remote access logging where you are sending a RPC 1:1 to have this kind of impact for small requests.
@huang195 regarding kicking the van down the road, if there is always some work waiting to be picked up, moving path length to the log filter still adds to latency. At low rps this is not noticeable, but pronounced at higher rps.
Have you done any experiments with the grpc accesslog service? The current implementation does not batch, sends a message for every request. We are working on another accesslog extension for stack driver that does batch but is more selective in what it sends.
Mixer client is an access log adapter that is batched but send absolutely everything with every request.
I think ultimately some balance of compression, batching and selectiveness will win out.
@mandarjog I haven't done any experiments with the grpc accesslog service yet. From what I can gather, there are some implicit constraints about running an Envoy filter due to i) the number of worker threads is fixed, ii) once a worker thread "accepts" a new connection, it cannot hand it off to another thread even if it is overloaded and the other is idling, iii) each worker is single-threaded.
So to minimize performance variance, an Envoy filter should not have vastly different code execution paths. In mixer client today, the default behavior is 99 out 100 requests will go through the same code execution path, but the 100th request will go to a much longer path, i.e., protobuf serialization, compression, network I/O. This is one cause of the long-tail latency problem. But variance can be caused by other things, such as load, multitenancy, randomness in CPU scheduling, etc.
The mental diagram I have about Envoy filter is that it has a critical path and a non-critical path. Policy/Quota checking and request processing/forwarding are on the critical path. Telemetry/tracing/and other stuff that happens afterwards are on the non-critical path. Due to Envoy's single-threadedness, non-critical path could sometimes impact critical path. Code on the critical path needs to run regardless. Code on the non-critical path, on the other hand, should have control points that in the most extreme case, could be short circuited completely, freeing the worker thread to work on something more pressing. Envoy internally keeps a lot of runtime stats, which could be used as triggers to these control points.
In the scenario where latency is due to 1:1 RPC hogging CPU time, and assuming that the RPC server is on the same machine, this might be worth considering: instead of using protobuf/gRPC, use a zero-copy, (de)serialization-free message format (like Cap'n Proto) and pass messages via file descriptor over Unix domain socket (the memory for the message(s) would be allocated via shm
or memfd_create
, and the message format allows for reading and writing directly to this mapped memory, making this possible). That at least skips a serialize and deserialize step, as well as the copy out of Envoy and into the RPC process. Off course, now the RPC server still needs to be careful about not burning too many CPU cycles doing whatever it's going to do with e.g. the logs or whatever, but at least it takes that concern off of Envoy's shoulders.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.
@huang195 you had an experiment that showed much better latency when you change Linux scheduler settings, do you have a link to the documentation? cc @iandyh
same problem ,up to 60% cpu in stress test,
is there any new solution
Benchmarking Envoy and comparing performance against HAProxy
Setup: LB : Envoy/HAProxy Backend : Nginx Benchmarking Tool : wrk (https://github.com/wg/wrk)
Envoy Config: Concurrency = 4
HAProxy Config:
Nginx Config
Benchmark Results
Envoy
$ wrk -c100 -d60s -t10 "http://172.16.x.x:8090/" --latency
HAProxy
$ wrk -c100 -d60s -t10 "http://172.16.x.x:8080/" --latency
Note:
Could you please point out where am I going wrong? (As according to various online blogs, envoy seems to provide equivalent performance as HAProxy)