patch/optimize(bpf): improve wan tcp hijack datapath performance

jschwinger233 commented 3 months ago

Background

这个 PR 引入了两个新的 bpf 程序来加速 WAN TCP。

This PR introduces two new BPF programs to accelerate WAN TCP.

总体来说，原本的 WAN TCP 劫持路径的数据平面如下图：

In general, the data plane of the original WAN TCP interception path is as shown in the following diagram:

 ┌─────────┐                   ┌─────────┐ 
 │ process │                   │ process │ 
 └────┬────┘                   └────▲────┘ 
      │                             │      
 ┌────▼────┐                   ┌────┴────┐ 
 │ socket  │                   │ socket  │ 
 └────┬────┘                   └────▲────┘ 
      │                             │      
 ┌────▼────┐                   ┌────┴────┐ 
 │ tcp/ip  │                   │ tcp/ip  │ 
 └────┬────┘                   └────▲────┘ 
      │                             │      
 ┌────▼────┐    ┌────┬────┐    ┌────┴────┐ 
 │ routing ├────►veth│veth├────► routing │ 
 └─────────┘    └────┴────┘    └─────────┘

这个 PR 把上述路径优化为：

This PR optimizes the above path to:

 ┌─────────┐                   ┌─────────┐ 
 │ process │                   │ process │ 
 └────┬────┘                   └────▲────┘ 
      │                             │      
 ┌────▼────┐                   ┌────┴────┐ 
 │ socket  ├───────────────────► socket  │ 
 └─────────┘                   └─────────┘ 

 ┌─────────┐                   ┌─────────┐ 
 │ tcp/ip  │                   │ tcp/ip  │ 
 └─────────┘                   └─────────┘ 

 ┌─────────┐    ┌────┬────┐    ┌─────────┐ 
 │ routing │    │veth│veth│    │ routing │ 
 └─────────┘    └────┴────┘    └─────────┘

优化成果见 Benchmark。

The optimization results can be seen in the Benchmark.

实现细节

需要联合使用两个 bpf:

BPF_PROG_TYPE_SOCK_OPS：这个类型的 bpf 是 attach 在 cgroup 上，可以在 TCP socket 三次握手完成时被触发。我们通过检查 routing_tuples_map 来判断一个 socket 是否是 WAN 代理的 socket，如果是的话就用 bpf_sock_hash_update 把 socket 加入 sockmap。
BPF_PROG_TYPE_SK_MSG：这个类型的 bpf 是 attach 一个 sockmap 上，就是第一步收集的 WAN 代理劫持的 sockets。它会在 socket 发送消息的时候触发，通过调用 bpf_msg_redirect_hash 实现 TCP segment 的直接投递。

注意 TCP 握手和挥手依然走内核栈，这部分是不加速的，只有建立连接后才可以

Implementation Details

Two BPF programs need to be used in conjunction:

BPF_PROG_TYPE_SOCK_OPS: This type of BPF is attached to a cgroup and triggered upon completion of the TCP socket's three-way handshake. We check the routing_tuples_map to determine if a socket is a WAN proxy socket. If it is, we use bpf_sock_hash_update to add the socket to the sockmap.
BPF_PROG_TYPE_SK_MSG: This type of BPF is attached to a sockmap, which contains the sockets collected in the first step of intercepting WAN proxies. It is triggered when a socket sends a message, and it uses bpf_msg_redirect_hash to directly deliver TCP segments.

Note that TCP handshakes and tear-downs still go through the kernel stack and are not accelerated. Only after the connection is established can acceleration take place.

Benchmark

使用 sockperf 测试 latency

To test latency using sockperf,

dae-0.4.0 结果是

dae-0.4.0 Results

# nsenter -t $(pidof dae-0.4.0) -n sockperf ping-pong -i 172.18.0.3 --tcp --time 10
sockperf: == version #3.7-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 172.18.0.3      PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=10.000 sec; Warm up time=400 msec; SentMessages=134874; ReceivedMessages=134873
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=9.550 sec; SentMessages=128877; ReceivedMessages=128877
sockperf: ====> avg-latency=37.006 (std-dev=5.955)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 37.006 usec
sockperf: Total 128877 observations; each percentile contains 1288.77 observations
sockperf: ---> <MAX> observation =  420.339
sockperf: ---> percentile 99.999 =  313.563
sockperf: ---> percentile 99.990 =  206.996
sockperf: ---> percentile 99.900 =   79.486
sockperf: ---> percentile 99.000 =   50.174
sockperf: ---> percentile 90.000 =   42.508
sockperf: ---> percentile 75.000 =   39.476
sockperf: ---> percentile 50.000 =   36.514
sockperf: ---> percentile 25.000 =   34.145
sockperf: ---> <MIN> observation =   21.565

这个 PR 的结果是

Results with this PR

# nsenter -t $(pidof dae) -n sockperf ping-pong -i 172.18.0.3 --tcp --time 10
sockperf: == version #3.7-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 172.18.0.3      PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=10.000 sec; Warm up time=400 msec; SentMessages=143488; ReceivedMessages=143487
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=9.550 sec; SentMessages=137069; ReceivedMessages=137069
sockperf: ====> avg-latency=34.788 (std-dev=6.701)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 34.788 usec
sockperf: Total 137069 observations; each percentile contains 1370.69 observations
sockperf: ---> <MAX> observation =  425.241
sockperf: ---> percentile 99.999 =  407.120
sockperf: ---> percentile 99.990 =  244.703
sockperf: ---> percentile 99.900 =   80.511
sockperf: ---> percentile 99.000 =   47.190
sockperf: ---> percentile 90.000 =   40.633
sockperf: ---> percentile 75.000 =   37.325
sockperf: ---> percentile 50.000 =   34.607
sockperf: ---> percentile 25.000 =   31.777
sockperf: ---> <MIN> observation =   20.779

TCP latency 提升 6%

TCP latency is improved by 6%

但 latency 只是性能的一部分，如果是 iperf 跑 tcp rr (round-trip) 在我虚拟机上会直接把内存跑炸

However, latency is just one aspect of performance. If running iperf for TCP round-trip (RR) tests on my virtual machine, it would directly cause excessive memory usage.

[Mon Mar 25 18:17:02 2024] Out of memory: Killed process 1233 (dae) total-vm:1315492kB, anon-rss:86784kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:296kB oom_score_adj:0
[Mon Mar 25 18:17:02 2024] TCP: out of memory -- consider tuning tcp_mem

在实际场景中，比如 redis-server 和 redis-benchmark 中的表现往往能达到 10%+ 的 p99 提升。

In real-world scenarios, such as in Redis-server and Redis-benchmark, performance improvements of over 10% in p99 latency are often achievable.

Checklist

[ ] The Pull Request has been fully tested
[ ] There's an entry in the CHANGELOGS
[ ] There is a user-facing docs PR against https://github.com/daeuniverse/dae

Full Changelogs

[Implement ...]

Issue Reference

Closes #[issue number]

Test Result

mzz2017 commented 3 months ago

这个优化非常令人兴奋，这或许已经是当前 linux 系统下的最优性能方案（代理 wan 的情况下）。通过 socket 重定向直接将路径缩至最短，非常极致的优化！

针对这次优化，是否需要更高版本的内核？如果是，我们或许需要增加一些判断和提示（像之前的代码那样），以及更新一些文档。

jschwinger233 commented 3 months ago

针对这次优化，是否需要更高版本的内核？如果是，我们或许需要增加一些判断和提示（像之前的代码那样），以及更新一些文档。

CI 测过了 5.10 貌似是好的。 dae 目前要求 >=5.8，我自己编译一个 5.8 试试

jschwinger233 commented 3 months ago

针对这次优化，是否需要更高版本的内核？如果是，我们或许需要增加一些判断和提示（像之前的代码那样），以及更新一些文档。

编译了 5.8 （妈的这版本 EOL 了我手动改了 objtool/elf.c 才编过，还把我磁盘占满了），不能运行，报错 in-kernel BTF is malformed，但我觉得单纯是因为 5.8 又老又 EOL 在编译时 binutils 没有正确生成 BTF，不代表真的无法在 5.8 运行。

但是考虑到以后我可能很难测试 5.8，如果可以稍微提高内核要求到 5.10 就更好了，5.10 是一个 LTS 版本，要 31 Dec 2026 才停止支持 ( https://endoflife.date/linux ) ，目前的 CI Kernel-test 也有测它。

amtoaer commented 3 months ago

使用该版本 dae 遇到一个问题。抽象出来应该是这种情况：在 dae 宿主机运行两个 docker 容器 A、B 提供 web 服务，均使用 network_mode: bridge 运行。其中 A 的端口映射为 a:a，B 的端口映射为 b:b。 docker 的默认 bridge 如下：

docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255

此时，在 A 容器中访问 http://172.17.0.1:b/ 理应能够访问到 B 容器的 web 服务，但使用该 PR 的 build，这个请求会无响应。

daily main，无论是否开启 dae 均对该类请求无影响：该 PR，开启 dae 后请求无响应:

jschwinger233 commented 3 months ago

@amtoaer dae 是不是设置了 lan_interface: docker0

amtoaer commented 3 months ago

@jschwinger233 是的，我的配置是：

    lan_interface: docker0,br0
    wan_interface: br0

jschwinger233 commented 3 months ago

@amtoaer 好我忘了这个场景了能处理

amtoaer commented 3 months ago

@jschwinger233 正常工作了，感谢！

mzz2017 commented 3 months ago

@jschwinger233 可以的，提高到5.10没问题

mzz2017 commented 3 months ago

@jschwinger233 麻烦在相关的代码和文档中将要求提高到 5.10，谢谢

daeuniverse / dae