patch/optimize(bpf): improve lan hijack datapath performance

jschwinger233 commented 4 months ago

Background

这个 PR 引入了三项针对 lan 的性能优化。先回顾 datapath：

                ┌──────────────────┐ 
  1             │ 2                │ 
┌────┐     ┌────┼────┐      ┌───┐  │ 
│    ├─────►    │    ├──────►   │  │ 
│lan0│     │dae0│peer│      │dae│  │ 
│    ◄─────┤    │    ◄──────┤   │  │ 
└────┘     └────┼────┘      └───┘  │ 
             3  │     dae netns    │ 
                └──────────────────┘ 

a. bpf_lan_ingress: 做分流决策：直连流量放行进入网络栈，分流流量调用 bpf_redirect 重定向给 dae0
b. bpf_peer_ingress: 只有分流流量才可能到达这里，调用 bpf_skc_lookup 和 bpf_sk_assign 把流量指定给 dae socket
c. bpf_dae0_ingress: 只有分流流量的 **回复** 才可能到达这里，调用 bpf_redirect 把它重定向回 wan0

优化 1：a 和 b 处的 bpf 程序都解析了一遍二三四层的包头，其实没有必要解析两次，在 a 出解析完了之后可以通过 skb->cb 把 b 处需要知道的信息夹带过去。优化 2：b 处的 peer_ingress bpf 没有必要对 established tcp 调用 bpf_skc_lookup 查询 socket，因为内核本身就可以完成 socket lookup。在开启 tcp_early_demux 的情况下还可以避免路由决策直接做 local delivery。优化 3：a 处的 lan_ingress 可以调用 bpf_redirect_peer 直接重定向给 netns 内部的 peer，避免 enqueue_to_backlog 造成的性能影响。

Background

This PR introduces 3 performance optimizations. First, let's review the datapath:

                ┌──────────────────┐ 
  1             │ 2                │ 
┌────┐     ┌────┼────┐      ┌───┐  │ 
│    ├─────►    │    ├──────►   │  │ 
│lan0│     │dae0│peer│      │dae│  │ 
│    ◄─────┤    │    ◄──────┤   │  │ 
└────┘     └────┼────┘      └───┘  │ 
             3  │     dae netns    │ 
                └──────────────────┘ 

a. bpf_lan_ingress: Make split routing decisions: Direct traffic enters the network stack, and split traffic is redirected to dae0 using bpf_redirect.
b. bpf_peer_ingress: Only split traffic can reach this point, using bpf_skc_lookup and bpf_sk_assign to assign traffic to the dae socket.
c. bpf_dae0_ingress: Only split traffic **replies** can reach this point, using bpf_redirect to redirect it back to wan0.

Optimization 1: Both the BPF programs at points a and b have parsed the packet headers up to layers two, three, and four. It's unnecessary to parse them twice. After parsing at point a, the information needed at point b can be passed using skb->cb.

Optimization 2: The peer_ingress BPF at point b doesn't need to perform socket lookup for established TCP connections using bpf_skc_lookup because the kernel itself can handle socket lookup. With tcp_early_demux enabled, it can also avoid routing decisions and perform local delivery directly.

Optimization 3: The lan_ingerss at point a redirects the skb from wan0 to dae0, which then goes through netns to reach the peer. This step can be simplified using bpf_redirect_peer: redirect the skb directly from lan0 to the peer inside the netns, avoiding performance impact from enqueue_to_backlog.

Recommendation: Review by commit.

Checklist

[x] The Pull Request has been fully tested
[ ] There's an entry in the CHANGELOGS
[ ] There is a user-facing docs PR against https://github.com/daeuniverse/dae

Full Changelogs

[Implement ...]

Issue Reference

Closes #[issue number]

Test Result

sdgrfe commented 4 months ago

测试通过

amtoaer commented 4 months ago

正常工作

Mitsuhaxy commented 4 months ago

It's working fine

dae-prow[bot] commented 4 months ago

❌ Your branch is currently out-of-sync to main. No worry, I will fix it for you.

dae-prow[bot] commented 4 months ago

❌ Your branch is currently out-of-sync to main. No worry, I will fix it for you.

douglarek commented 3 months ago

Tested in the following environment, works very well.

A router: Linux ImmortalWrt 6.1.78 #0 SMP PREEMPT Mon Feb 19 15:48:41 2024 aarch64 GNU/Linux

A workstation: Linux Manjaro 6.7.7-1-MANJARO #1 SMP PREEMPT_DYNAMIC Fri Mar  1 18:26:06 UTC 2024 x86_64 GNU/Linux

jschwinger233 commented 3 months ago

Thank all folks who keep testing this PR, https://github.com/daeuniverse/dae/pull/466/commits/5badabfc8a21d5f2accc49329e8e8da58d415049 is the last low-hanging fruit whose temptation I can't resist. Hope this small patch doesn't break anything :crossed_fingers:

The lpc2020 had a talk introducing this bpf_redirect_peer which allows ingress to ingress redirection without going through CPU's backlog queue. Cilium sees +1.3Gbit/sec perf boost by using it.

douglarek commented 3 months ago

After binding docker0 to the LAN and testing https://github.com/daeuniverse/dae/commit/5badabfc8a21d5f2accc49329e8e8da58d415049, everything works perfectly. There are no issues with direct connection diversion. Well done.

A workstation: Linux Manjaro 6.7.7-1-MANJARO #1 SMP PREEMPT_DYNAMIC Fri Mar  1 18:26:06 UTC 2024 x86_64 GNU/Linux

amtoaer commented 3 months ago

使用最新 CI build在以下环境测试成功：

Linux GracPC 6.7.5-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Sat, 17 Feb 2024 14:02:21 +0000 x86_64 GNU/Linux

Linux NAS 6.7.4-arch1-1 #1 SMP PREEMPT_DYNAMIC Mon, 05 Feb 2024 22:07:49 +0000 x86_64 GNU/Linux

jschwinger233 commented 3 months ago

Benchmark (lan only)

1. Env: Linux 6.6.17 KVM, 4 cores, 12G memory.

2. Setup

Run two docker containers, one has dae inside, the other has v2ray. It's almost the same as dae's github action test: just see two containers as two nodes.

I am using sockperf:

Run sockperf server on the v2ray side: (for UDP test, delete --tcp)

nsenter -t $(pidof v2ray) -n sockperf server -i 172.18.0.3 --tcp --daemonize

Run sockperf client inside the "pod" to emulate lan proxy: (for UDP test, delete --tcp)
```
nsenter -t $(pidof pod) -n sockperf ping-pong -i 172.18.0.3 --tcp --time 10
```

3. TCP

dae-0.4.0: avg-latency=37.310 (std-dev=7.352) this pr: avg-latency=36.792 (std-dev=7.437)

avg-latency improves by 1.3%.

This seems not too much, because the testing environment is clean and free from netfilter.

After adding a simple iptables rule on the dae node:

iptables -t raw -A PREROUTING -p tcp -m tcp --dport 11111 -j ACCEPT

dae-0.4.0 will perform worse, sometimes avg-latency could go as high as 38+, while dae-next (this pr) won't be affected at all because of stack bypass implementation. In the case, it's 3.1% improvement.

4. UDP

The normal UDP test result is:

dae-0.4.0: avg-latency=58.275 (std-dev=50.721) dae-next: avg-latency=55.927 (std-dev=48.332)

4% boost.

However, it is also known that dae-0.4.0 uses encapsulation to avoid port conflict if there is a process already listening on 53, which damages performance badly. When that fallback takes place, dae-0.4.0's avg-latency will drop to 60.412 (std-dev=47.764), and dae-next has 7%+ better result.

daeuniverse / dae