envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.82k stars 4.77k forks source link

IoUring support #28395

Open soulxu opened 1 year ago

soulxu commented 1 year ago

Create this issue for tracking and discussing the IoUring effort.

We did a third iteration of the IoUring implementation https://github.com/envoyproxy/envoy/pull/24958. We get all the IoUringAcceptSocket, IoUringServerSocket, and IoUringClientSocket implemented. We pass most of the tests in the CI. For the x86 test, we get a green https://github.com/envoyproxy/envoy/pull/24958/checks?check_run_id=14653030831

For other tests, we think it failed on an existing race issue in the Envoy https://github.com/envoyproxy/envoy/issues/15072, and that issue can't be fixed quickly and easily, we leave that alone for now.

So summarize the CI status as below:

Also notes:

We also benchmarked the current implementation:

Fortio <-> Envoy <-> Fortio server Envoy runs in 1-thread, and Fortio generates requests for 300s with connections varying from 32 to 128. Hardware: i9-9900HK

The RPS data are collected with max load condition and the latency data are collected with a fixed 5000 QPS, the CPU usage is around 50%.

| Connection RPS/Latency    | 32                | 64                | 128               |
|------------------------   |-----------------  |-----------------  |-----------------  |
| Default                   | 13945.3 / 2.007   | 14354.4 / 3.967   | 13515.7 / 8.323   |
| io_uring                  | 15582.6 / 1.808   | 16337.7 / 3.826   | 15963.7 / 7.171   |
| Delta%                    | 12% / 10%         | 14% / 4%          | 18% / 14%         |

Summary of the TODOs we left: There are also missing a few features, and those are features that make IoUring finally production-ready:

Something may improve the performance

cc @zhxie

soulxu commented 1 year ago

I'm thinking of what we should do next. Is it ok that we begin to merge the IoUring code? Then our implementation can be under the community's review, coaching and help, then we can resolve those existing issues in the correct way and flaky issues in the end. I'm worried about if we continue to fix those issues for months without the community feedback when we begin to merge the code, we will review the code again and may find something fixed in the incorrect way, then we spend more time.

We still see some flaky in the ARM CI (less in x86 CI), also thinking about whether we can run separate test tasks for the iouring, and those tasks won't block any PR merge for now. Until we fully implement the iouring and resolve the issue like https://github.com/envoyproxy/envoy/issues/15072, and the CI becomes steady enough, then we change those iouring test tasks will block the PR merge if they failed.

cc @KBaichoo @mattklein123 @yanavlasov @phlax looking for some feedback, thanks in advance!

KBaichoo commented 1 year ago

This is great news! Thank you for working on this.

Is it ok that we begin to merge the IoUring code? Then our implementation can be under the community's review, coaching and help, then we can resolve those existing issues in the correct way and flaky issues in the end. I'm worried about if we continue to fix those issues for months without the community feedback when we begin to merge the code, we will review the code again and may find something fixed in the incorrect way, then we spend more time.

I agree we should break up the change into some smaller PRs and get those merged in. We can block certain failing tests from running with io_uring if not compatible yet while we go about this process.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

zhxie commented 1 year ago

Not stale