datahangar / sfunnel

K8s service funneling using eBPF
BSD 2-Clause "Simplified" License
4 stars 0 forks source link

Severe performance degradation when TCP is funneled over UDP (GSO/TSO) #11

Open msune opened 2 days ago

msune commented 2 days ago

Summary

There is a severe performance degradation when TCP is funneled over UDP on flows within the same host.

I was able to repro here: msune/ebpf_gso:main, using this synthetic scenario.

~/personal/ebpf_gso/test$ make check_perf_calibration 
------------------------------------------------------------
Server listening on TCP port 80
TCP window size:  128 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 10.0.1.2, TCP port 80
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  1] local 10.0.0.1 port 47032 connected with 10.0.1.2 port 80 (icwnd/mss/irtt=13/1388/53)
[  1] local 10.0.1.2 port 80 connected with 10.0.0.1 port 47032 (icwnd/mss/irtt=13/1388/33)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0003 sec  5.73 GBytes  4.92 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0109 sec  5.73 GBytes  4.91 Gbits/sec
make[1]: Entering directory '/home/marc/personal/ebpf_gso/test'
make[1]: Leaving directory '/home/marc/personal/ebpf_gso/test'
~/personal/ebpf_gso/test$ make check_perf
------------------------------------------------------------
Server listening on TCP port 8080
TCP window size:  128 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 10.0.1.2, TCP port 8080
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  1] local 10.0.0.1 port 35932 connected with 10.0.1.2 port 8080 (icwnd/mss/irtt=13/1388/36)
[  1] local 10.0.1.2 port 8080 connected with 10.0.0.1 port 35932 (icwnd/mss/irtt=13/1388/15)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-20.4668 sec  69.2 KBytes  27.7 Kbits/sec
make[1]: Entering directory '/home/marc/personal/ebpf_gso/test'
Waiting for server threads to complete. Interrupt again to force quit.
make[1]: Leaving directory '/home/marc/personal/ebpf_gso/test'
~/personal/ebpf_gso/test$ [ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-24.5630 sec  60.0 Bytes  19.5 bits/sec

Env:

Root cause analysis

The repro pushes a UDP header in ns1 and pops it on ns2.

pwru clearly shows that the skb is marked as SKB_GSO_TCPV4 (0x1) (*). When UDP header is pushed:

0xffff9027fa937400 2   ~in/iperf:135125 4026532397 0              307        0x0800 1440  2816  10.0.0.1:58330->10.0.1.2:80(udp)   udp4_ufo_fragment
(struct skb_shared_info){
 .nr_frags = (__u8)1,
 .gso_size = (short unsigned int)1380,
 .gso_type = (unsigned int)3, <---------------------------- 
 .dataref = (atomic_t){
  .counter = (int)65538,
 },
 .frags = (skb_frag_t[])[

   0x00000000faa803d8,
   2776,
   60,
  },
 ],
}

When the kernel attempts to UFO the packet:

0xffff9027df8a4800 6   ~bin/iperf:69662 4026532606 0              107        0x0800 1440  7000  10.0.0.1:42592->10.0.1.2:80(udp)   inet_gso_segment
0xffff9027df8a4800 6   ~bin/iperf:69662 4026532606 0              107        0x0800 1440  6980  10.0.0.1:42592->10.0.1.2:80(udp)   udp4_ufo_fragment
0xffff9027df8a4800 6   ~bin/iperf:69662 4026532606 0              107        0x0800 1440  7014  10.0.0.1:42592->10.0.1.2:80(udp)   kfree_skb_reason(SKB_DROP_REASON_NOT_SPECIFIED)
Full pwru trace ``` 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 0 0x0000 1440 6992 10.0.0.1:42592->10.0.1.2:8080(tcp) ip_local_out 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 0 0x0000 1440 6992 10.0.0.1:42592->10.0.1.2:8080(tcp) __ip_local_out 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 0 0x0800 1440 6992 10.0.0.1:42592->10.0.1.2:8080(tcp) ip_output 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 6992 10.0.0.1:42592->10.0.1.2:8080(tcp) nf_hook_slow 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 6992 10.0.0.1:42592->10.0.1.2:8080(tcp) apparmor_ip_postroute 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 6992 10.0.0.1:42592->10.0.1.2:8080(tcp) ip_finish_output 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 6992 10.0.0.1:42592->10.0.1.2:8080(tcp) __ip_finish_output 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 6992 10.0.0.1:42592->10.0.1.2:8080(tcp) ip_finish_output2 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 6992 10.0.0.1:42592->10.0.1.2:8080(tcp) neigh_resolve_output 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 6992 10.0.0.1:42592->10.0.1.2:8080(tcp) eth_header 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 6992 10.0.0.1:42592->10.0.1.2:8080(tcp) skb_push 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 7006 10.0.0.1:42592->10.0.1.2:8080(tcp) __dev_queue_xmit 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 7006 10.0.0.1:42592->10.0.1.2:8080(tcp) tcf_classify 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 7006 10.0.0.1:42592->10.0.1.2:8080(tcp) skb_ensure_writable 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 7006 10.0.0.1:42592->10.0.1.2:8080(udp) skb_ensure_writable 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 7006 10.0.0.1:42592->10.0.1.2:8080(udp) bpf_skb_generic_push 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 7006 10.0.0.1:42592->10.0.1.2:8080(udp) skb_push 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 7014 10.0.0.1:42592->10.0.1.2:80(udp) netdev_core_pick_tx 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 7014 10.0.0.1:42592->10.0.1.2:80(udp) validate_xmit_skb 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 7014 10.0.0.1:42592->10.0.1.2:80(udp) netif_skb_features 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 7014 10.0.0.1:42592->10.0.1.2:80(udp) passthru_features_check 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 7014 10.0.0.1:42592->10.0.1.2:80(udp) skb_network_protocol 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 7014 10.0.0.1:42592->10.0.1.2:80(udp) __skb_gso_segment 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 7014 10.0.0.1:42592->10.0.1.2:80(udp) skb_mac_gso_segment 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 7014 10.0.0.1:42592->10.0.1.2:80(udp) skb_network_protocol 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 7000 10.0.0.1:42592->10.0.1.2:80(udp) inet_gso_segment 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 6980 10.0.0.1:42592->10.0.1.2:80(udp) udp4_ufo_fragment 0xffff9027df8a4800 6 ~bin/iperf:69662 4026532606 0 107 0x0800 1440 7014 10.0.0.1:42592->10.0.1.2:80(udp) kfree_skb_reason(SKB_DROP_REASON_NOT_SPECIFIED) ```

It drops it) as it can't find SKB_GSO_UDP/GSO_UDP_L4 in here https://github.com/torvalds/linux/blob/de2f378f2b771b39594c04695feee86476743a69/net/ipv4/udp_offload.c#L429.

(*) There seems to be a bug in the kernel; disabling all GSO/TSO offloads keeps marking egress SKBs as TCP_GSO

msune commented 2 days ago

(*) There seems to be a bug in the kernel; disabling all GSO/TSO offloads keeps marking egress SKBs as TCP_GSO

This should/will be investigated elsewhere, as it's not strictly related to sfunnel/ebpf.

msune commented 2 days ago

Trying to find a workaround

Thge main issue is that there is no direct access to skb->gso_type. Some strategies I tried so far:

Not working: bpf_skb_adjust_room() with encap flags

None of the flags listed in the doc works for the purpose, as:

Not working (bug?): abusing bpf_skb_change_tail()

bpf_skb_change_tail() doc mentions:

This helper is a slow path utility intended for replies with control messages. And because it is targeted for slow path, the helper itself can afford to be slow: it implicitly linearizes, unclones and drops offloads from the skb.

The skb gso_type is correctly reset to 0x0, but then the skb is >> MTU. The (big) packet is later on dropped due to the MTU check (ofc, because the pkt is NOT anymore a GSOed pkt), in __dev_forward_skb2() which ends up calling __is_skb_forwardable(), here is the check

(See pkt size 2842 >> iface mtu 1440)

0xffff902880e2a000 1   ksoftirqd/1:21   4026532600 0              487        0x0800 1440  2842  10.0.0.1:60148->10.0.1.2:8080(tcp) __dev_forward_skb
0xffff902880e2a000 1   ksoftirqd/1:21   4026532600 0              487        0x0800 1440  2842  10.0.0.1:60148->10.0.1.2:8080(tcp) __dev_forward_skb2
0xffff902880e2a000 1   ksoftirqd/1:21   4026532600 0              487        0x0800 1440  2842  10.0.0.1:60148->10.0.1.2:8080(tcp) kfree_skb_reason(SKB_DROP_REASON_NOT_SPECIFIED)

A simple repro of this issue - without encap/decaps - here msune/ebpf_gso:change_tail_gso.

I think this is a bug, and bpf_skb_change_tail() should break the beefy packet into the segments before being sent (and should be done after all TC BPF hooks, I guess). This is probably worth having a discussion in cilium #ebpf slack channel.