Title: Migrate event loops from file events to buffer events

Description:

This can help to spare at least 3 syscalls per request (two writev()s and one readv()).

Currently Envoy relies on Libevent as an implementation for event loops and uses it in the "readiness" paradigm. That is, Libevent (on behalf of the kernel) notifies the application (Envoy) that a certain file descriptor is ready to be written or read; then the application makes a syscall to read or to write to the file descriptor. ~20% of a request time span Envoy spends on waiting for return from the syscalls.

Libevent also provides the API called "bufferevents" for event loops working in the "completeness" paradigm (or buffered/ringed IO). That is, an application registers read and write buffers, then Libevent notifies the application that there is data available in the read buffer or that the chunk of data put in the write buffer has been consumed. In this case if the underlying OS supports such paradigm (like in case of Microsoft's IOCP) then less computing resources are spent on syscalls like writev() or readv() - there is no need for context switches. Otherwise it's still file events hidden under the hood.

Linux supports the "completeness" paradigm natively too with its io_uring syscall, but Libevent hasn't been updated to support it yet. There is an issue for that though.

Perhaps we could modify Envoy's event loop to work in the "completeness" paradigm by relying on "bufferevents" for streaming connections and hoping for io_uring/ioring support to be added soon. Currently I don't know if "bufferevents" incur additional overhead compared with the traditional "readiness" approach. With my quick hack Envoy works even ~5% faster (14500 rps vs 15500 rps), but I presume that's because I have few things broken like e.g. flow control.

Alternatively we could abstract Libevent out somehow and resort to home grown event loops using io_uring/ioring directly when it's available and falling back to Libevent's "readiness" API otherwise. Probably with this approach it would be easier to implement as an extension a hardware accelerated event loop bypassing the kernel completely for network transfers.

/cc @antoniovicente

For historical context, in "the beginning" Envoy did use bufferevent. I don't recall the history, but libevent's bufferevent actually had quite a lot of different issues, both perf and functionality, which is why we ended up moving away from it.

Alternatively we could abstract Libevent out somehow and resort to home grown event loops using io_uring/ioring directly when it's available and falling back to Libevent's "readiness" API otherwise. Probably with this approach it would be easier to implement as an extension a hardware accelerated event loop bypassing the kernel completely for network transfers.

I think this would be a better approach. Similar to how we support pluggable IoHandle implementations (now we have user space IoHandles, etc.) I think it would be pretty nice to supprot a "completeness" IoHandle that performs better with IOCP on Windows and io_uring on Linux. I think this would be an awesome thing to work on!

I've played a bit with io_uring and I can't say it magically improves performance. It's about the same if not worse. I haven't spent too much time on optimizations though.

But there's one thing which gives hope: if io_uring is set up in IORING_SETUP_SQPOLL mode then in my tests with a single worker thread Envoy starts to perform noticeably better. In this mode the kernel fires up an auxiliary thread (in kernel space) which polls the ring buffer and does data moving instead of the worker thread. The latter doesn't have to switch context for accept(), connect(), readv() and writev() syscalls. Here are the non-scientific results for a single worker running on i9-7920X:

Normal epoll-based Envoy

rojkov@drozhkov:~/work/envoy (perf-research)$ ../nighthawk/bazel-bin/nighthawk_client --duration 15 --rps 100000 --open-loop http://127.0.0.1:10000/
[16:18:50.265828][3868438][I] Starting 1 threads / event loops. Time limit: 15 seconds.
[16:18:50.265864][3868438][I] Global targets: 100 connections and 100000 calls per second.
[16:19:05.816058][3868444][I] Stopping after 15000 ms. Initiated: 1499910 / Completed: 1499832. (Completion rate was 99987.87344570608 per second.)
Nighthawk - A layer 7 protocol benchmarking tool.

benchmark_http_client.latency_2xx (213343 samples)
  min: 0s 002ms 880us | mean: 0s 006ms 775us | max: 0s 021ms 251us | pstdev: 0s 002ms 011us

  Percentile  Count       Value
  0.5         106677      0s 006ms 452us
  0.75        160009      0s 007ms 011us
  0.8         170690      0s 007ms 145us
  0.9         192010      0s 007ms 855us
  0.95        202676      0s 010ms 506us
  0.990625    211343      0s 016ms 622us
  0.99902344  213135      0s 018ms 872us

Queueing and connection setup latency (213421 samples)
  min: 0s 000ms 001us | mean: 0s 000ms 072us | max: 0s 042ms 164us | pstdev: 0s 000ms 723us

  Percentile  Count       Value
  0.5         106824      0s 000ms 004us
  0.75        160076      0s 000ms 005us
  0.8         170738      0s 000ms 006us
  0.9         192079      0s 000ms 033us
  0.95        202750      0s 000ms 139us
  0.990625    211421      0s 001ms 197us
  0.99902344  213213      0s 013ms 528us

Request start to response end (213343 samples)
  min: 0s 002ms 880us | mean: 0s 006ms 775us | max: 0s 021ms 250us | pstdev: 0s 002ms 011us

  Percentile  Count       Value
  0.5         106689      0s 006ms 452us
  0.75        160019      0s 007ms 011us
  0.8         170686      0s 007ms 145us
  0.9         192009      0s 007ms 855us
  0.95        202676      0s 010ms 506us
  0.990625    211343      0s 016ms 622us
  0.99902344  213135      0s 018ms 872us

Response body size in bytes (213343 samples)
  min: 10 | mean: 10.0 | max: 10 | pstdev: 0.0

Response header size in bytes (213343 samples)
  min: 159 | mean: 159.00411543851854 | max: 160 | pstdev: 0.06401954143062451

Initiation to completion (1499832 samples)
  min: 0s 000ms 000us | mean: 0s 001ms 068us | max: 0s 054ms 581us | pstdev: 0s 002ms 510us

  Percentile  Count       Value
  0.5         749920      0s 000ms 029us
  0.75        1124876     0s 000ms 296us
  0.8         1199869     0s 000ms 430us
  0.9         1349851     0s 006ms 010us
  0.95        1424851     0s 006ms 848us
  0.990625    1485773     0s 009ms 573us
  0.99902344  1498369     0s 017ms 407us

Counter                                 Value       Per second
benchmark.http_2xx                      213343      14222.73
benchmark.pool_overflow                 1286489     85765.14
cluster_manager.cluster_added           1           0.07
default.total_match_count               1           0.07
membership_change                       1           0.07
runtime.load_success                    1           0.07
runtime.override_dir_not_exists         1           0.07
upstream_cx_http1_total                 100         6.67
upstream_cx_overflow                    194641      12975.95
upstream_cx_rx_bytes_total              45656280    3043723.79
upstream_cx_total                       100         6.67
upstream_cx_tx_bytes_total              8750220     583342.59
upstream_rq_pending_overflow            1286489     85765.14
upstream_rq_pending_total               23932       1595.45
upstream_rq_total                       213421      14227.93

[16:19:10.923507][3868444][I] Wait for the connection pool drain timed out, proceeding to hard shutdown.
[16:19:10.932525][3868438][I] Done.

iouring-based Envoy with IORING_SETUP_SQPOLL

rojkov@drozhkov:~/work/envoy (perf-research)$ ../nighthawk/bazel-bin/nighthawk_client --duration 30 --rps 100000 --open-loop http://127.0.0.1:10000/
[16:26:42.508152][344639][I] Starting 1 threads / event loops. Time limit: 30 seconds.
[16:26:42.508190][344639][I] Global targets: 100 connections and 100000 calls per second.
[16:27:13.059298][344645][I] Stopping after 30001 ms. Initiated: 3000008 / Completed: 2999921. (Completion rate was 99993.77022406428 per second.)
Nighthawk - A layer 7 protocol benchmarking tool.

benchmark_http_client.latency_2xx (636232 samples)
  min: 0s 001ms 254us | mean: 0s 004ms 230us | max: 0s 088ms 608us | pstdev: 0s 001ms 280us

  Percentile  Count       Value
  0.5         318117      0s 003ms 918us
  0.75        477175      0s 004ms 770us
  0.8         508995      0s 005ms 021us
  0.9         572619      0s 005ms 687us
  0.95        604426      0s 006ms 261us
  0.990625    630269      0s 007ms 803us
  0.99902344  635611      0s 010ms 662us

Queueing and connection setup latency (636319 samples)
  min: 0s 000ms 001us | mean: 0s 000ms 048us | max: 0s 107ms 204us | pstdev: 0s 000ms 986us

  Percentile  Count       Value
  0.5         318163      0s 000ms 003us
  0.75        477254      0s 000ms 004us
  0.8         509080      0s 000ms 005us
  0.9         572689      0s 000ms 030us
  0.95        604504      0s 000ms 068us
  0.990625    630354      0s 000ms 287us
  0.99902344  635698      0s 019ms 259us

Request start to response end (636232 samples)
  min: 0s 001ms 254us | mean: 0s 004ms 230us | max: 0s 088ms 608us | pstdev: 0s 001ms 280us

  Percentile  Count       Value
  0.5         318165      0s 003ms 918us
  0.75        477195      0s 004ms 770us
  0.8         508991      0s 005ms 021us
  0.9         572614      0s 005ms 686us
  0.95        604423      0s 006ms 260us
  0.990625    630270      0s 007ms 803us
  0.99902344  635612      0s 010ms 662us

Response body size in bytes (636232 samples)
  min: 60 | mean: 60.0 | max: 60 | pstdev: 0.0

Response header size in bytes (636232 samples)
  min: 160 | mean: 160.00002514805882 | max: 161 | pstdev: 0.005014721006561768

Initiation to completion (2999921 samples)
  min: 0s 000ms 000us | mean: 0s 001ms 155us | max: 0s 110ms 845us | pstdev: 0s 001ms 842us

  Percentile  Count       Value
  0.5         1499986     0s 000ms 311us
  0.75        2249941     0s 001ms 028us
  0.8         2399944     0s 002ms 946us
  0.9         2699967     0s 004ms 018us
  0.95        2849954     0s 004ms 953us
  0.990625    2971798     0s 006ms 530us
  0.99902344  2996992     0s 009ms 261us

Counter                                 Value       Per second
benchmark.http_2xx                      636232      21206.97
benchmark.pool_overflow                 2363689     78786.80
cluster_manager.cluster_added           1           0.03
default.total_match_count               1           0.03
membership_change                       1           0.03
runtime.load_success                    1           0.03
runtime.override_dir_not_exists         1           0.03
upstream_cx_http1_total                 100         3.33
upstream_cx_overflow                    600798      20025.88
upstream_cx_rx_bytes_total              168601496   5619847.73
upstream_cx_total                       100         3.33
upstream_cx_tx_bytes_total              26089038    869603.32
upstream_rq_pending_overflow            2363689     78786.80
upstream_rq_pending_total               79653       2655.00
upstream_rq_total                       636319      21209.87

[16:27:18.190261][344645][I] Wait for the connection pool drain timed out, proceeding to hard shutdown.
[16:27:18.199575][344639][I] Done.

Perhaps little.BIG architectures may benefit from SQPOLL. Or even hyperthreading could be useful if there's a way to pin the auxiliary thread to the second logical core (hopefully in this mode there won't be too much cache thrashing).

PS. my hack doesn't have any integration with libevent, all deferred deletes are done explicitly in ioring's event loop. I don't know how to do such integration properly.

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

The work hasn't staled and is progressing slowly. The current state is here: now io_uring doesn't run its own event loop and is integrated with libevent (through eventfd).

Finally submitted a draft PR (#19082). I did one more set of test runs, but this time with CPU pinning like

$ taskset --cpu-list 0 bazel-bin/source/exe/envoy-static -c rojkov-data/envoy-config-perf-measurement-io_uring.yaml --concurrency 1 -l warn

Default socket interface

$ ../nighthawk/bazel-bin/nighthawk_client --duration 30 --rps 100000 --open-loop http://127.0.0.1:10000/
[10:26:38.007986][415877][I] Starting 1 threads / event loops. Time limit: 30 seconds.
[10:26:38.008026][415877][I] Global targets: 100 connections and 100000 calls per second.
[10:27:08.558070][415883][I] Stopping after 30000 ms. Initiated: 2999999 / Completed: 2999898. (Completion rate was 99996.52666921377 per second.)
Nighthawk - A layer 7 protocol benchmarking tool.

benchmark_http_client.latency_2xx (513985 samples)
  min: 0s 002ms 005us | mean: 0s 005ms 611us | max: 0s 019ms 096us | pstdev: 0s 000ms 827us

  Percentile  Count       Value
  0.5         257043      0s 005ms 607us
  0.75        385529      0s 006ms 064us
  0.8         411198      0s 006ms 175us
  0.9         462596      0s 006ms 486us
  0.95        488290      0s 006ms 809us
  0.990625    509167      0s 008ms 401us
  0.99902344  513484      0s 009ms 688us

Queueing and connection setup latency (514085 samples)
  min: 0s 000ms 001us | mean: 0s 000ms 059us | max: 0s 032ms 322us | pstdev: 0s 000ms 556us

  Percentile  Count       Value
  0.5         257191      0s 000ms 002us
  0.75        385612      0s 000ms 004us
  0.8         411428      0s 000ms 004us
  0.9         462688      0s 000ms 005us
  0.95        488382      0s 000ms 011us
  0.990625    509266      0s 001ms 639us
  0.99902344  513583      0s 006ms 520us

Request start to response end (513985 samples)
  min: 0s 002ms 005us | mean: 0s 005ms 611us | max: 0s 019ms 096us | pstdev: 0s 000ms 827us

  Percentile  Count       Value
  0.5         256993      0s 005ms 607us
  0.75        385493      0s 006ms 064us
  0.8         411220      0s 006ms 175us
  0.9         462614      0s 006ms 486us
  0.95        488295      0s 006ms 809us
  0.990625    509169      0s 008ms 401us
  0.99902344  513484      0s 009ms 688us

Response body size in bytes (513985 samples)
  min: 10 | mean: 10.0 | max: 10 | pstdev: 0.0

Response header size in bytes (513985 samples)
  min: 141 | mean: 141.0000136190747 | max: 142 | pstdev: 0.0036903778949415986

Initiation to completion (2999898 samples)
  min: 0s 000ms 000us | mean: 0s 001ms 053us | max: 0s 036ms 605us | pstdev: 0s 002ms 162us

  Percentile  Count       Value
  0.5         1499956     0s 000ms 010us
  0.75        2249937     0s 000ms 352us
  0.8         2399921     0s 000ms 617us
  0.9         2699915     0s 005ms 515us
  0.95        2849946     0s 006ms 057us
  0.990625    2971779     0s 006ms 959us
  0.99902344  2996970     0s 009ms 218us

Counter                                 Value       Per second
benchmark.http_2xx                      513985      17132.82
benchmark.pool_overflow                 2485913     82863.70
cluster_manager.cluster_added           1           0.03
default.total_match_count               1           0.03
membership_change                       1           0.03
runtime.load_success                    1           0.03
runtime.override_dir_not_exists         1           0.03
upstream_cx_http1_total                 100         3.33
upstream_cx_overflow                    448170      14938.99
upstream_cx_rx_bytes_total              98685127    3289501.74
upstream_cx_total                       100         3.33
upstream_cx_tx_bytes_total              21077485    702582.30
upstream_rq_pending_overflow            2485913     82863.70
upstream_rq_pending_total               22230       741.00
upstream_rq_total                       514085      17136.15

[10:27:13.689368][415883][I] Wait for the connection pool drain timed out, proceeding to hard shutdown.
[10:27:13.697869][415877][I] Done.

io_uring socket interface

Activated with this config

bootstrap_extensions:
  - name: envoy.extensions.io_socket.io_uring
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.network.socket_interface.v3.IoUringSocketInterface
      use_submission_queue_polling: false
      read_buffer_size: 8192
      io_uring_size: 300
default_socket_interface: "envoy.extensions.network.socket_interface.io_uring"

$ ../nighthawk/bazel-bin/nighthawk_client --duration 30 --rps 100000 --open-loop http://127.0.0.1:10000/
[10:21:15.578123][415501][I] Starting 1 threads / event loops. Time limit: 30 seconds.
[10:21:15.578159][415501][I] Global targets: 100 connections and 100000 calls per second.
[10:21:46.128184][415507][I] Stopping after 30000 ms. Initiated: 2999998 / Completed: 2999897. (Completion rate was 99996.56666666667 per second.)
Nighthawk - A layer 7 protocol benchmarking tool.

benchmark_http_client.latency_2xx (667211 samples)
  min: 0s 001ms 736us | mean: 0s 004ms 377us | max: 0s 082ms 018us | pstdev: 0s 000ms 759us

  Percentile  Count       Value
  0.5         333665      0s 004ms 248us
  0.75        500462      0s 004ms 483us
  0.8         533858      0s 004ms 543us
  0.9         600517      0s 004ms 734us
  0.95        633857      0s 005ms 940us
  0.990625    660959      0s 006ms 625us
  0.99902344  666560      0s 007ms 933us

Queueing and connection setup latency (667311 samples)
  min: 0s 000ms 001us | mean: 0s 000ms 045us | max: 0s 033ms 665us | pstdev: 0s 000ms 344us

  Percentile  Count       Value
  0.5         334022      0s 000ms 001us
  0.75        500527      0s 000ms 002us
  0.8         534039      0s 000ms 002us
  0.9         600626      0s 000ms 002us
  0.95        633948      0s 000ms 004us
  0.990625    661055      0s 001ms 394us
  0.99902344  666660      0s 004ms 177us

Request start to response end (667211 samples)
  min: 0s 001ms 735us | mean: 0s 004ms 377us | max: 0s 082ms 018us | pstdev: 0s 000ms 759us

  Percentile  Count       Value
  0.5         333701      0s 004ms 248us
  0.75        500511      0s 004ms 483us
  0.8         533897      0s 004ms 543us
  0.9         600501      0s 004ms 734us
  0.95        633857      0s 005ms 940us
  0.990625    660961      0s 006ms 625us
  0.99902344  666560      0s 007ms 933us

Response body size in bytes (667211 samples)
  min: 10 | mean: 10.0 | max: 10 | pstdev: 0.0

Response header size in bytes (667211 samples)
  min: 141 | mean: 141.0 | max: 141 | pstdev: 0.0

Initiation to completion (2999897 samples)
  min: 0s 000ms 000us | mean: 0s 001ms 011us | max: 0s 082ms 055us | pstdev: 0s 001ms 881us

  Percentile  Count       Value
  0.5         1500020     0s 000ms 003us
  0.75        2249924     0s 000ms 125us
  0.8         2399933     0s 003ms 918us
  0.9         2699953     0s 004ms 314us
  0.95        2849977     0s 004ms 535us
  0.990625    2971783     0s 006ms 157us
  0.99902344  2996968     0s 007ms 815us

Counter                                 Value       Per second
benchmark.http_2xx                      667211      22240.37
benchmark.pool_overflow                 2332686     77756.20
cluster_manager.cluster_added           1           0.03
default.total_match_count               1           0.03
membership_change                       1           0.03
runtime.load_success                    1           0.03
runtime.override_dir_not_exists         1           0.03
upstream_cx_http1_total                 100         3.33
upstream_cx_overflow                    606844      20228.13
upstream_cx_rx_bytes_total              128104512   4270150.35
upstream_cx_total                       100         3.33
upstream_cx_tx_bytes_total              27359751    911991.69
upstream_rq_pending_overflow            2332686     77756.20
upstream_rq_pending_total               18133       604.43
upstream_rq_total                       667311      22243.70

[10:21:51.249381][415507][I] Wait for the connection pool drain timed out, proceeding to hard shutdown.
[10:21:51.258415][415501][I] Done.

Nice this looks really promising!

envoyproxy / envoy

Migrate event loops from file events to buffer events #17922

Normal epoll-based Envoy

iouring-based Envoy with IORING_SETUP_SQPOLL

Default socket interface

io_uring socket interface