Open rojkov opened 3 years ago
For historical context, in "the beginning" Envoy did use bufferevent. I don't recall the history, but libevent's bufferevent actually had quite a lot of different issues, both perf and functionality, which is why we ended up moving away from it.
Alternatively we could abstract Libevent out somehow and resort to home grown event loops using io_uring/ioring directly when it's available and falling back to Libevent's "readiness" API otherwise. Probably with this approach it would be easier to implement as an extension a hardware accelerated event loop bypassing the kernel completely for network transfers.
I think this would be a better approach. Similar to how we support pluggable IoHandle implementations (now we have user space IoHandles, etc.) I think it would be pretty nice to supprot a "completeness" IoHandle that performs better with IOCP on Windows and io_uring on Linux. I think this would be an awesome thing to work on!
I've played a bit with io_uring and I can't say it magically improves performance. It's about the same if not worse. I haven't spent too much time on optimizations though.
But there's one thing which gives hope: if io_uring is set up in IORING_SETUP_SQPOLL
mode then in my tests with a single worker thread Envoy starts to perform noticeably better. In this mode the kernel fires up an auxiliary thread (in kernel space) which polls the ring buffer and does data moving instead of the worker thread. The latter doesn't have to switch context for accept()
, connect()
, readv()
and writev()
syscalls. Here are the non-scientific results for a single worker running on i9-7920X:
rojkov@drozhkov:~/work/envoy (perf-research)$ ../nighthawk/bazel-bin/nighthawk_client --duration 15 --rps 100000 --open-loop http://127.0.0.1:10000/
[16:18:50.265828][3868438][I] Starting 1 threads / event loops. Time limit: 15 seconds.
[16:18:50.265864][3868438][I] Global targets: 100 connections and 100000 calls per second.
[16:19:05.816058][3868444][I] Stopping after 15000 ms. Initiated: 1499910 / Completed: 1499832. (Completion rate was 99987.87344570608 per second.)
Nighthawk - A layer 7 protocol benchmarking tool.
benchmark_http_client.latency_2xx (213343 samples)
min: 0s 002ms 880us | mean: 0s 006ms 775us | max: 0s 021ms 251us | pstdev: 0s 002ms 011us
Percentile Count Value
0.5 106677 0s 006ms 452us
0.75 160009 0s 007ms 011us
0.8 170690 0s 007ms 145us
0.9 192010 0s 007ms 855us
0.95 202676 0s 010ms 506us
0.990625 211343 0s 016ms 622us
0.99902344 213135 0s 018ms 872us
Queueing and connection setup latency (213421 samples)
min: 0s 000ms 001us | mean: 0s 000ms 072us | max: 0s 042ms 164us | pstdev: 0s 000ms 723us
Percentile Count Value
0.5 106824 0s 000ms 004us
0.75 160076 0s 000ms 005us
0.8 170738 0s 000ms 006us
0.9 192079 0s 000ms 033us
0.95 202750 0s 000ms 139us
0.990625 211421 0s 001ms 197us
0.99902344 213213 0s 013ms 528us
Request start to response end (213343 samples)
min: 0s 002ms 880us | mean: 0s 006ms 775us | max: 0s 021ms 250us | pstdev: 0s 002ms 011us
Percentile Count Value
0.5 106689 0s 006ms 452us
0.75 160019 0s 007ms 011us
0.8 170686 0s 007ms 145us
0.9 192009 0s 007ms 855us
0.95 202676 0s 010ms 506us
0.990625 211343 0s 016ms 622us
0.99902344 213135 0s 018ms 872us
Response body size in bytes (213343 samples)
min: 10 | mean: 10.0 | max: 10 | pstdev: 0.0
Response header size in bytes (213343 samples)
min: 159 | mean: 159.00411543851854 | max: 160 | pstdev: 0.06401954143062451
Initiation to completion (1499832 samples)
min: 0s 000ms 000us | mean: 0s 001ms 068us | max: 0s 054ms 581us | pstdev: 0s 002ms 510us
Percentile Count Value
0.5 749920 0s 000ms 029us
0.75 1124876 0s 000ms 296us
0.8 1199869 0s 000ms 430us
0.9 1349851 0s 006ms 010us
0.95 1424851 0s 006ms 848us
0.990625 1485773 0s 009ms 573us
0.99902344 1498369 0s 017ms 407us
Counter Value Per second
benchmark.http_2xx 213343 14222.73
benchmark.pool_overflow 1286489 85765.14
cluster_manager.cluster_added 1 0.07
default.total_match_count 1 0.07
membership_change 1 0.07
runtime.load_success 1 0.07
runtime.override_dir_not_exists 1 0.07
upstream_cx_http1_total 100 6.67
upstream_cx_overflow 194641 12975.95
upstream_cx_rx_bytes_total 45656280 3043723.79
upstream_cx_total 100 6.67
upstream_cx_tx_bytes_total 8750220 583342.59
upstream_rq_pending_overflow 1286489 85765.14
upstream_rq_pending_total 23932 1595.45
upstream_rq_total 213421 14227.93
[16:19:10.923507][3868444][I] Wait for the connection pool drain timed out, proceeding to hard shutdown.
[16:19:10.932525][3868438][I] Done.
rojkov@drozhkov:~/work/envoy (perf-research)$ ../nighthawk/bazel-bin/nighthawk_client --duration 30 --rps 100000 --open-loop http://127.0.0.1:10000/
[16:26:42.508152][344639][I] Starting 1 threads / event loops. Time limit: 30 seconds.
[16:26:42.508190][344639][I] Global targets: 100 connections and 100000 calls per second.
[16:27:13.059298][344645][I] Stopping after 30001 ms. Initiated: 3000008 / Completed: 2999921. (Completion rate was 99993.77022406428 per second.)
Nighthawk - A layer 7 protocol benchmarking tool.
benchmark_http_client.latency_2xx (636232 samples)
min: 0s 001ms 254us | mean: 0s 004ms 230us | max: 0s 088ms 608us | pstdev: 0s 001ms 280us
Percentile Count Value
0.5 318117 0s 003ms 918us
0.75 477175 0s 004ms 770us
0.8 508995 0s 005ms 021us
0.9 572619 0s 005ms 687us
0.95 604426 0s 006ms 261us
0.990625 630269 0s 007ms 803us
0.99902344 635611 0s 010ms 662us
Queueing and connection setup latency (636319 samples)
min: 0s 000ms 001us | mean: 0s 000ms 048us | max: 0s 107ms 204us | pstdev: 0s 000ms 986us
Percentile Count Value
0.5 318163 0s 000ms 003us
0.75 477254 0s 000ms 004us
0.8 509080 0s 000ms 005us
0.9 572689 0s 000ms 030us
0.95 604504 0s 000ms 068us
0.990625 630354 0s 000ms 287us
0.99902344 635698 0s 019ms 259us
Request start to response end (636232 samples)
min: 0s 001ms 254us | mean: 0s 004ms 230us | max: 0s 088ms 608us | pstdev: 0s 001ms 280us
Percentile Count Value
0.5 318165 0s 003ms 918us
0.75 477195 0s 004ms 770us
0.8 508991 0s 005ms 021us
0.9 572614 0s 005ms 686us
0.95 604423 0s 006ms 260us
0.990625 630270 0s 007ms 803us
0.99902344 635612 0s 010ms 662us
Response body size in bytes (636232 samples)
min: 60 | mean: 60.0 | max: 60 | pstdev: 0.0
Response header size in bytes (636232 samples)
min: 160 | mean: 160.00002514805882 | max: 161 | pstdev: 0.005014721006561768
Initiation to completion (2999921 samples)
min: 0s 000ms 000us | mean: 0s 001ms 155us | max: 0s 110ms 845us | pstdev: 0s 001ms 842us
Percentile Count Value
0.5 1499986 0s 000ms 311us
0.75 2249941 0s 001ms 028us
0.8 2399944 0s 002ms 946us
0.9 2699967 0s 004ms 018us
0.95 2849954 0s 004ms 953us
0.990625 2971798 0s 006ms 530us
0.99902344 2996992 0s 009ms 261us
Counter Value Per second
benchmark.http_2xx 636232 21206.97
benchmark.pool_overflow 2363689 78786.80
cluster_manager.cluster_added 1 0.03
default.total_match_count 1 0.03
membership_change 1 0.03
runtime.load_success 1 0.03
runtime.override_dir_not_exists 1 0.03
upstream_cx_http1_total 100 3.33
upstream_cx_overflow 600798 20025.88
upstream_cx_rx_bytes_total 168601496 5619847.73
upstream_cx_total 100 3.33
upstream_cx_tx_bytes_total 26089038 869603.32
upstream_rq_pending_overflow 2363689 78786.80
upstream_rq_pending_total 79653 2655.00
upstream_rq_total 636319 21209.87
[16:27:18.190261][344645][I] Wait for the connection pool drain timed out, proceeding to hard shutdown.
[16:27:18.199575][344639][I] Done.
Perhaps little.BIG architectures may benefit from SQPOLL. Or even hyperthreading could be useful if there's a way to pin the auxiliary thread to the second logical core (hopefully in this mode there won't be too much cache thrashing).
PS. my hack doesn't have any integration with libevent, all deferred deletes are done explicitly in ioring's event loop. I don't know how to do such integration properly.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
The work hasn't staled and is progressing slowly. The current state is here: now io_uring doesn't run its own event loop and is integrated with libevent (through eventfd).
Finally submitted a draft PR (#19082). I did one more set of test runs, but this time with CPU pinning like
$ taskset --cpu-list 0 bazel-bin/source/exe/envoy-static -c rojkov-data/envoy-config-perf-measurement-io_uring.yaml --concurrency 1 -l warn
$ ../nighthawk/bazel-bin/nighthawk_client --duration 30 --rps 100000 --open-loop http://127.0.0.1:10000/
[10:26:38.007986][415877][I] Starting 1 threads / event loops. Time limit: 30 seconds.
[10:26:38.008026][415877][I] Global targets: 100 connections and 100000 calls per second.
[10:27:08.558070][415883][I] Stopping after 30000 ms. Initiated: 2999999 / Completed: 2999898. (Completion rate was 99996.52666921377 per second.)
Nighthawk - A layer 7 protocol benchmarking tool.
benchmark_http_client.latency_2xx (513985 samples)
min: 0s 002ms 005us | mean: 0s 005ms 611us | max: 0s 019ms 096us | pstdev: 0s 000ms 827us
Percentile Count Value
0.5 257043 0s 005ms 607us
0.75 385529 0s 006ms 064us
0.8 411198 0s 006ms 175us
0.9 462596 0s 006ms 486us
0.95 488290 0s 006ms 809us
0.990625 509167 0s 008ms 401us
0.99902344 513484 0s 009ms 688us
Queueing and connection setup latency (514085 samples)
min: 0s 000ms 001us | mean: 0s 000ms 059us | max: 0s 032ms 322us | pstdev: 0s 000ms 556us
Percentile Count Value
0.5 257191 0s 000ms 002us
0.75 385612 0s 000ms 004us
0.8 411428 0s 000ms 004us
0.9 462688 0s 000ms 005us
0.95 488382 0s 000ms 011us
0.990625 509266 0s 001ms 639us
0.99902344 513583 0s 006ms 520us
Request start to response end (513985 samples)
min: 0s 002ms 005us | mean: 0s 005ms 611us | max: 0s 019ms 096us | pstdev: 0s 000ms 827us
Percentile Count Value
0.5 256993 0s 005ms 607us
0.75 385493 0s 006ms 064us
0.8 411220 0s 006ms 175us
0.9 462614 0s 006ms 486us
0.95 488295 0s 006ms 809us
0.990625 509169 0s 008ms 401us
0.99902344 513484 0s 009ms 688us
Response body size in bytes (513985 samples)
min: 10 | mean: 10.0 | max: 10 | pstdev: 0.0
Response header size in bytes (513985 samples)
min: 141 | mean: 141.0000136190747 | max: 142 | pstdev: 0.0036903778949415986
Initiation to completion (2999898 samples)
min: 0s 000ms 000us | mean: 0s 001ms 053us | max: 0s 036ms 605us | pstdev: 0s 002ms 162us
Percentile Count Value
0.5 1499956 0s 000ms 010us
0.75 2249937 0s 000ms 352us
0.8 2399921 0s 000ms 617us
0.9 2699915 0s 005ms 515us
0.95 2849946 0s 006ms 057us
0.990625 2971779 0s 006ms 959us
0.99902344 2996970 0s 009ms 218us
Counter Value Per second
benchmark.http_2xx 513985 17132.82
benchmark.pool_overflow 2485913 82863.70
cluster_manager.cluster_added 1 0.03
default.total_match_count 1 0.03
membership_change 1 0.03
runtime.load_success 1 0.03
runtime.override_dir_not_exists 1 0.03
upstream_cx_http1_total 100 3.33
upstream_cx_overflow 448170 14938.99
upstream_cx_rx_bytes_total 98685127 3289501.74
upstream_cx_total 100 3.33
upstream_cx_tx_bytes_total 21077485 702582.30
upstream_rq_pending_overflow 2485913 82863.70
upstream_rq_pending_total 22230 741.00
upstream_rq_total 514085 17136.15
[10:27:13.689368][415883][I] Wait for the connection pool drain timed out, proceeding to hard shutdown.
[10:27:13.697869][415877][I] Done.
Activated with this config
bootstrap_extensions:
- name: envoy.extensions.io_socket.io_uring
typed_config:
"@type": type.googleapis.com/envoy.extensions.network.socket_interface.v3.IoUringSocketInterface
use_submission_queue_polling: false
read_buffer_size: 8192
io_uring_size: 300
default_socket_interface: "envoy.extensions.network.socket_interface.io_uring"
$ ../nighthawk/bazel-bin/nighthawk_client --duration 30 --rps 100000 --open-loop http://127.0.0.1:10000/
[10:21:15.578123][415501][I] Starting 1 threads / event loops. Time limit: 30 seconds.
[10:21:15.578159][415501][I] Global targets: 100 connections and 100000 calls per second.
[10:21:46.128184][415507][I] Stopping after 30000 ms. Initiated: 2999998 / Completed: 2999897. (Completion rate was 99996.56666666667 per second.)
Nighthawk - A layer 7 protocol benchmarking tool.
benchmark_http_client.latency_2xx (667211 samples)
min: 0s 001ms 736us | mean: 0s 004ms 377us | max: 0s 082ms 018us | pstdev: 0s 000ms 759us
Percentile Count Value
0.5 333665 0s 004ms 248us
0.75 500462 0s 004ms 483us
0.8 533858 0s 004ms 543us
0.9 600517 0s 004ms 734us
0.95 633857 0s 005ms 940us
0.990625 660959 0s 006ms 625us
0.99902344 666560 0s 007ms 933us
Queueing and connection setup latency (667311 samples)
min: 0s 000ms 001us | mean: 0s 000ms 045us | max: 0s 033ms 665us | pstdev: 0s 000ms 344us
Percentile Count Value
0.5 334022 0s 000ms 001us
0.75 500527 0s 000ms 002us
0.8 534039 0s 000ms 002us
0.9 600626 0s 000ms 002us
0.95 633948 0s 000ms 004us
0.990625 661055 0s 001ms 394us
0.99902344 666660 0s 004ms 177us
Request start to response end (667211 samples)
min: 0s 001ms 735us | mean: 0s 004ms 377us | max: 0s 082ms 018us | pstdev: 0s 000ms 759us
Percentile Count Value
0.5 333701 0s 004ms 248us
0.75 500511 0s 004ms 483us
0.8 533897 0s 004ms 543us
0.9 600501 0s 004ms 734us
0.95 633857 0s 005ms 940us
0.990625 660961 0s 006ms 625us
0.99902344 666560 0s 007ms 933us
Response body size in bytes (667211 samples)
min: 10 | mean: 10.0 | max: 10 | pstdev: 0.0
Response header size in bytes (667211 samples)
min: 141 | mean: 141.0 | max: 141 | pstdev: 0.0
Initiation to completion (2999897 samples)
min: 0s 000ms 000us | mean: 0s 001ms 011us | max: 0s 082ms 055us | pstdev: 0s 001ms 881us
Percentile Count Value
0.5 1500020 0s 000ms 003us
0.75 2249924 0s 000ms 125us
0.8 2399933 0s 003ms 918us
0.9 2699953 0s 004ms 314us
0.95 2849977 0s 004ms 535us
0.990625 2971783 0s 006ms 157us
0.99902344 2996968 0s 007ms 815us
Counter Value Per second
benchmark.http_2xx 667211 22240.37
benchmark.pool_overflow 2332686 77756.20
cluster_manager.cluster_added 1 0.03
default.total_match_count 1 0.03
membership_change 1 0.03
runtime.load_success 1 0.03
runtime.override_dir_not_exists 1 0.03
upstream_cx_http1_total 100 3.33
upstream_cx_overflow 606844 20228.13
upstream_cx_rx_bytes_total 128104512 4270150.35
upstream_cx_total 100 3.33
upstream_cx_tx_bytes_total 27359751 911991.69
upstream_rq_pending_overflow 2332686 77756.20
upstream_rq_pending_total 18133 604.43
upstream_rq_total 667311 22243.70
[10:21:51.249381][415507][I] Wait for the connection pool drain timed out, proceeding to hard shutdown.
[10:21:51.258415][415501][I] Done.
Nice this looks really promising!
Title: Migrate event loops from file events to buffer events
Description:
This can help to spare at least 3 syscalls per request (two
writev()
s and onereadv()
).Currently Envoy relies on Libevent as an implementation for event loops and uses it in the "readiness" paradigm. That is, Libevent (on behalf of the kernel) notifies the application (Envoy) that a certain file descriptor is ready to be written or read; then the application makes a syscall to read or to write to the file descriptor. ~20% of a request time span Envoy spends on waiting for return from the syscalls.
Libevent also provides the API called "bufferevents" for event loops working in the "completeness" paradigm (or buffered/ringed IO). That is, an application registers read and write buffers, then Libevent notifies the application that there is data available in the read buffer or that the chunk of data put in the write buffer has been consumed. In this case if the underlying OS supports such paradigm (like in case of Microsoft's IOCP) then less computing resources are spent on syscalls like
writev()
orreadv()
- there is no need for context switches. Otherwise it's still file events hidden under the hood.Linux supports the "completeness" paradigm natively too with its
io_uring
syscall, but Libevent hasn't been updated to support it yet. There is an issue for that though.Perhaps we could modify Envoy's event loop to work in the "completeness" paradigm by relying on "bufferevents" for streaming connections and hoping for io_uring/ioring support to be added soon. Currently I don't know if "bufferevents" incur additional overhead compared with the traditional "readiness" approach. With my quick hack Envoy works even ~5% faster (14500 rps vs 15500 rps), but I presume that's because I have few things broken like e.g. flow control.
Alternatively we could abstract Libevent out somehow and resort to home grown event loops using io_uring/ioring directly when it's available and falling back to Libevent's "readiness" API otherwise. Probably with this approach it would be easier to implement as an extension a hardware accelerated event loop bypassing the kernel completely for network transfers.
/cc @antoniovicente