Martian packets in DSR mode

thardie commented 5 years ago

I have a test cluster setup. 1 master and 3 workers. Kube-router is running on all 4 nodes. I'm running 1 external IP for nginx (3 instances) with BGP amongst all kube-routers and BGP up to an upstream router. So, packet flow inbound is:

router->1 of the 4 nodes IPVS->IPIP tunnel to 1 of the 3 nginx instances->nginx

Inbound always works fine.

Outbound: nginx instance->host->router

Sometimes, and I don't know what causes this to engage, the host starts to drop the replies. I enable martian logging, and it's hitting the martian case. I tried to disable rp_filter for all interfaces on the host (including all and default) and there are still martians.

IPVS table:

IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.116.3.1:443 rr
  -> 10.116.20.150:6443           Masq    1      1          0
TCP  10.116.3.10:53 rr
  -> 10.116.128.6:53              Masq    1      0          0
  -> 10.116.128.7:53              Masq    1      0          0
TCP  10.116.3.178:443 rr
  -> 10.116.130.38:8443           Masq    1      0          0
TCP  10.116.3.216:80 rr
  -> 10.116.129.97:80             Masq    1      0          0
  -> 10.116.130.41:80             Masq    1      0          0
  -> 10.116.131.28:80             Masq    1      0          0
TCP  10.116.4.2:443 rr
  -> 10.116.130.38:8443           Masq    1      1          0
TCP  10.116.20.152:30799 rr
  -> 10.116.129.97:80             Masq    1      0          0
  -> 10.116.130.41:80             Masq    1      0          0
  -> 10.116.131.28:80             Masq    1      0          0
TCP  10.116.20.152:31278 rr
  -> 10.116.130.38:8443           Masq    1      0          0
UDP  10.116.3.10:53 rr
  -> 10.116.128.6:53              Masq    1      0          0
  -> 10.116.128.7:53              Masq    1      0          0
FWM  8742 rr
  -> 10.116.129.97:80             Tunnel  1      0          0
  -> 10.116.130.41:80             Tunnel  1      0          0
  -> 10.116.131.28:80             Tunnel  1      0          0

mangle table:

Chain PREROUTING (policy ACCEPT 2628 packets, 1176K bytes)
 pkts bytes target     prot opt in     out     source               destination
  159  7124 MARK       tcp  --  *      *       0.0.0.0/0            10.116.4.1           tcp dpt:80 MARK set 0x2226
    3   156 MARK       tcp  --  *      *       0.0.0.0/0            10.116.4.2           tcp dpt:8443 MARK set 0x280

Chain INPUT (policy ACCEPT 2480 packets, 1159K bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain FORWARD (policy ACCEPT 148 packets, 17027 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 2547 packets, 338K bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain POSTROUTING (policy ACCEPT 2695 packets, 355K bytes)
 pkts bytes target     prot opt in     out     source               destination

tcpdump showing issue:

# tcpdump -eni any host 10.116.4.1 or ip proto 4
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
00:49:35.829486  In 00:1e:be:a5:d0:00 ethertype IPv4 (0x0800), length 68: 10.104.5.122.51226 > 10.116.4.1.80: Flags [S], seq 832170241, win 64240, options [mss 1357,nop,wscale 8,nop,nop,sackOK], length 0
00:49:35.829566 Out 0a:58:0a:74:82:01 ethertype IPv4 (0x0800), length 88: 10.116.130.1 > 10.116.130.41: 10.104.5.122.51226 > 10.116.4.1.80: Flags [S], seq 832170241, win 64240, options [mss 1357,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
00:49:35.829572 Out 0a:58:0a:74:82:01 ethertype IPv4 (0x0800), length 88: 10.116.130.1 > 10.116.130.41: 10.104.5.122.51226 > 10.116.4.1.80: Flags [S], seq 832170241, win 64240, options [mss 1357,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
00:49:35.829646   P 0a:58:0a:74:82:29 ethertype IPv4 (0x0800), length 68: 10.116.4.1.80 > 10.104.5.122.51226: Flags [S.], seq 1511523134, ack 832170242, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:49:35.829653  In 0a:58:0a:74:82:29 ethertype IPv4 (0x0800), length 68: 10.116.4.1.80 > 10.104.5.122.51226: Flags [S.], seq 1511523134, ack 832170242, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:49:36.826269   P 0a:58:0a:74:82:29 ethertype IPv4 (0x0800), length 68: 10.116.4.1.80 > 10.104.5.122.51226: Flags [S.], seq 1511523134, ack 832170242, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:49:36.826285  In 0a:58:0a:74:82:29 ethertype IPv4 (0x0800), length 68: 10.116.4.1.80 > 10.104.5.122.51226: Flags [S.], seq 1511523134, ack 832170242, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:49:38.826284   P 0a:58:0a:74:82:29 ethertype IPv4 (0x0800), length 68: 10.116.4.1.80 > 10.104.5.122.51226: Flags [S.], seq 1511523134, ack 832170242, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:49:38.826303  In 0a:58:0a:74:82:29 ethertype IPv4 (0x0800), length 68: 10.116.4.1.80 > 10.104.5.122.51226: Flags [S.], seq 1511523134, ack 832170242, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:49:38.830451  In 00:1e:be:a5:d0:00 ethertype IPv4 (0x0800), length 68: 10.104.5.122.51226 > 10.116.4.1.80: Flags [S], seq 832170241, win 64240, options [mss 1357,nop,wscale 8,nop,nop,sackOK], length 0
00:49:38.830479 Out 0a:58:0a:74:82:01 ethertype IPv4 (0x0800), length 88: 10.116.130.1 > 10.116.130.41: 10.104.5.122.51226 > 10.116.4.1.80: Flags [S], seq 832170241, win 64240, options [mss 1357,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
00:49:38.830482 Out 0a:58:0a:74:82:01 ethertype IPv4 (0x0800), length 88: 10.116.130.1 > 10.116.130.41: 10.104.5.122.51226 > 10.116.4.1.80: Flags [S], seq 832170241, win 64240, options [mss 1357,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
00:49:38.830507   P 0a:58:0a:74:82:29 ethertype IPv4 (0x0800), length 68: 10.116.4.1.80 > 10.104.5.122.51226: Flags [S.], seq 1511523134, ack 832170242, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:49:38.830511  In 0a:58:0a:74:82:29 ethertype IPv4 (0x0800), length 68: 10.116.4.1.80 > 10.104.5.122.51226: Flags [S.], seq 1511523134, ack 832170242, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0

dmesg showing martians:

[81632.897744] ll header: 00000000: 0a 58 0a 74 82 01 0a 58 0a 74 82 29 08 00        .X.t...X.t.)..
[81635.233492] IPv4: martian source 10.104.5.122 from 10.116.4.1, on dev kube-bridge
[81635.233514] ll header: 00000000: 0a 58 0a 74 82 01 0a 58 0a 74 82 29 08 00        .X.t...X.t.)..
[81636.897461] IPv4: martian source 10.104.5.122 from 10.116.4.1, on dev kube-bridge
[81636.897468] ll header: 00000000: 0a 58 0a 74 82 01 0a 58 0a 74 82 29 08 00        .X.t...X.t.)..
[81638.897894] IPv4: martian source 10.104.5.122 from 10.116.4.1, on dev kube-bridge
[81638.897899] ll header: 00000000: 0a 58 0a 74 82 01 0a 58 0a 74 82 29 08 00        .X.t...X.t.)..
[81646.897322] IPv4: martian source 10.104.5.122 from 10.116.4.1, on dev kube-bridge
[81646.897349] ll header: 00000000: 0a 58 0a 74 82 01 0a 58 0a 74 82 29 08 00        .X.t...X.t.)..

murali-reddy commented 5 years ago

@thardie thanks for reporting the issue.

Dealing with martian packets has been the single most challenge in DSR functionality in kube-router. There are policy-based routing rules that kube-router adds to avoid martian packets. Likely they are missing or kube-router failed to configure them by in your setup.

If you still happen to have the setup or able to reproduce this scenario would mind sharing below details.

ip rule list ip route list table 77 ip route list table 78

In your case to avoid [81635.233492] IPv4: martian source 10.104.5.122 from 10.116.4.1, on dev kube-bridge I would expect a route in table 78 created by kube-router to cheat kernel to believe 10.104.5.122 is reachable on `kube-bridge.

thardie commented 5 years ago

I added the following 2 lines to each worker and master's /etc/sysctl.conf:

net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.all.rp_filter=0

and rebooted them all. Have been unable to reproduce the martians since then. I'm reverting that change and see if I can reproduce the martian issue now.

thardie commented 5 years ago

@murali-reddy I just re-read your comment - The address 10.104.5.122 is the outside client IP (Where the SYN came from, and where the SYN-ACK is going back to). My k8s address are all in 10.116.0.0/16, so I shouldn't expect to see client (outside) addresses in table 78, would I?

I'll continue to try and reproduce and get the ip rules and table output once reproduced again.

murali-reddy commented 5 years ago

@thardie sorry it should be 10.116.4.1 in routing tables 77 and 78

thardie commented 5 years ago

I've been able to reproduce this issue. I checked table 78 and 77. Table 77 is empty, and 78 has:

local default dev lo scope host

I tried adding a route to table 77 (looks like the rule to handle reply traffic coming out from the containers), but doesn't seem to help:

10.116.4.1 dev kube-bridge scope link

Adding it to table 78 seems wrong, since that's traffic coming in, and would mess up the IP-in-IP encapsulation, right? In fact, I start to see ARPs for 10.116.4.1 on kube-bridge if I add the same route to able 78.

murali-reddy commented 5 years ago

@thardie sorry i might have passed wrong table numbers earlier. You should see below tables (name, id)

please see https://github.com/cloudnativelabs/kube-router/blob/v0.2.3/pkg/controllers/proxy/network_services_controller.go#L1728-L1731

    customDSRRouteTableID    = "78"
    customDSRRouteTableName  = "kube-router-dsr"
    externalIPRouteTableId   = "79"
    externalIPRouteTableName = "external_ip"

Following combination of iptable mangle rules, policy based routing achieve the DSR.

For the incoming traffic towards external IP used for service marked with DSR following are rules apply:

generate a unique fwmark number per service and fmwark the packets
match traffic marked with a fmwark and use routing table 78

default rule in table 78 deliver the packet locally to the host

iptables -t mangle -A -d externalIP -m protocol, -p protocol --dport port -j MARK --set-mark generated-fwmark
ip rule add prio 32764 fwmark generated-fwmark table customDSRRouteTableID
ip route add local default dev lo table customDSRRouteTableID

on the return path of packet from the pods below rules are applicable. Second rule in particular avoids the martian packets.

ip rule add prio 32765 from all lookup externalIPRouteTableId
ip route add externalIP  dev kube-bridge table externalIPRouteTableId

Please match with this description in your setup and see if there is anything missing.

icefed commented 4 years ago

Hi, @thardie did you use Loadbalancer to public service? I have the same issue when I use DSR mode with metallb in layer2.

Loadbalancer not supported in code, I add it and testing ok. https://github.com/cloudnativelabs/kube-router/blob/4afd6d6d2ab9c94abc5985c30c56ca2605a70a3f/pkg/controllers/proxy/network_services_controller.go#L2198

Can we support Loadbalancer? Is there any risk?@murali-reddy

aauren commented 1 year ago

Closing as stale

gebi commented 1 week ago

JFYI... we had similar problems with a manual setup with ipvs in DSR mode and just could not get it to work without the kernel throwing away our packages as "martian source". So we just added a simple XDP/bpf program (on in our case a wireguard interface) prepended an ethernet header ourself and used xdp_redirect to eth0 to get the damn packages flowing out to the network... worked without issues.

cloudnativelabs / kube-router

Martian packets in DSR mode #511