cloudnativelabs / kube-router

Kube-router, a turnkey solution for Kubernetes networking.
https://kube-router.io
Apache License 2.0
2.33k stars 471 forks source link

v2.1: DSR+TCPMSS with non-ready services not set-up correctly #1671

Closed rkojedzinszky closed 6 months ago

rkojedzinszky commented 6 months ago

What happened? Have multiple services with the same externalIP, set up DSR for them. When one of the services are not ready, maxseg rules in nft for other services are missing. ref: https://github.com/cloudnativelabs/kube-router/issues/1664

What did you expect to happen? Expected to work as in v1.6.

How can we reproduce the behavior you experienced? Steps to reproduce the behavior:

  1. Create a deployment+service with dsr and externalIP, e.g.:
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: ght
    spec:
    selector:
    matchLabels:
      app: ght
    template:
    metadata:
      labels:
        app: ght
    spec:
      containers:
      - image: ghcr.io/rkojedzinszky/go-http-test
        name: go-http-test
    ---
    apiVersion: v1
    kind: Service
    metadata:
    name: ght
    annotations:
    kube-router.io/service.dsr: tunnel
    spec:
    externalIPs:
    - 192.168.9.88
    ports:
    - name: http
    port: 8080
    protocol: TCP
    targetPort: 8080
    selector:
    app: ght
  2. Observe correct behavior, tcpdump on node with pod:
    # nsenter -n -t 3898354 tcpdump -pvni any host 192.168.9.88
    tcpdump: data link type LINUX_SLL2
    tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
    07:48:15.753384 kube-tunnel-if In  IP (tos 0x0, ttl 61, id 7374, offset 0, flags [DF], proto TCP (6), length 60)
    192.168.10.10.58322 > 192.168.9.88.8080: Flags [S], cksum 0x653b (correct), seq 3111575680, win 64240, options [mss 1440,sackOK,TS val 1197502074 ecr 0,nop,wscale 7], length 0
    07:48:15.753490 eth0  Out IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    192.168.9.88.8080 > 192.168.10.10.58322: Flags [S.], cksum 0x94e1 (incorrect -> 0x531e), seq 1899687061, ack 3111575681, win 65160, options [mss 1460,sackOK,TS val 2093888449 ecr 1197502074,nop,wscale 7], length 0
    07:48:15.755483 kube-tunnel-if In  IP (tos 0x0, ttl 61, id 7375, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.10.10.58322 > 192.168.9.88.8080: Flags [.], cksum 0x7e7a (correct), ack 1, win 502, options [nop,nop,TS val 1197502077 ecr 2093888449], length 0

With default MTU of 1500, 40 (ip+tcp header) and an extra 20 (ipip) bytes are substracted, thus 1440 is seen in initial SYN packet.

  1. Create a non-ready service, e.g.:

    ---
    apiVersion: v1
    kind: Service
    metadata:
    name: ght-nr
    annotations:
    kube-router.io/service.dsr: tunnel
    spec:
    externalIPs:
    - 192.168.9.88
    ports:
    - name: http-nr
    port: 9090
    protocol: TCP
    targetPort: 9090
    selector:
    app: ght-nr
  2. Observe incorrect behavior, tcpdump on node with pod:

    # nsenter -n -t 3898354 tcpdump -pvni any host 192.168.9.88
    tcpdump: data link type LINUX_SLL2
    tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
    07:52:28.330275 kube-tunnel-if In  IP (tos 0x0, ttl 61, id 64827, offset 0, flags [DF], proto TCP (6), length 60)
    192.168.10.10.57548 > 192.168.9.88.8080: Flags [S], cksum 0xe5d7 (correct), seq 3011939874, win 64240, options [mss 1460,sackOK,TS val 1197754650 ecr 0,nop,wscale 7], length 0
    07:52:28.330380 eth0  Out IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    192.168.9.88.8080 > 192.168.10.10.57548: Flags [S.], cksum 0x94e1 (incorrect -> 0x1685), seq 3856751251, ack 3011939875, win 65160, options [mss 1460,sackOK,TS val 2094141026 ecr 1197754650,nop,wscale 7], length 0
    07:52:28.332584 kube-tunnel-if In  IP (tos 0x0, ttl 61, id 64828, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.10.10.57548 > 192.168.9.88.8080: Flags [.], cksum 0x41e1 (correct), ack 1, win 502, options [nop,nop,TS val 1197754653 ecr 2094141026], length 0

You can see incorrect mss in initial SYN packet.

nft rules also reflect this, as there are no maxseg rules for ip daddr 192.168.9.88 tcp dport 8080:

table ip mangle {
        chain KUBE-IPTABLES-HINT {
        }

        chain KUBE-KUBELET-CANARY {
        }

        chain PREROUTING {
                type filter hook prerouting priority mangle; policy accept;
                ip daddr 192.168.9.1 tcp dport 1883 counter packets 5661 bytes 582074 meta mark set 0x2e69
                ip daddr 192.168.9.1 tcp flags syn / syn,rst counter packets 65 bytes 3356 tcp option maxseg size set 1440
                ip daddr 192.168.9.1 tcp dport 3100 counter packets 4996 bytes 2186703 meta mark set 0x3e6d
                ip daddr 192.168.9.0 tcp dport 53 counter packets 0 bytes 0 meta mark set 0x234
                ip daddr 192.168.9.0 tcp flags syn / syn,rst counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.9.0 udp dport 53 counter packets 173 bytes 11214 meta mark set 0x1e8c
                ip daddr 192.168.9.2 tcp dport 1222 counter packets 0 bytes 0 meta mark set 0xa15
                ip daddr 192.168.9.2 tcp flags syn / syn,rst counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.8.192 tcp dport 443 counter packets 0 bytes 0 meta mark set 0x2f16
                ip daddr 192.168.8.192 tcp dport 80 counter packets 0 bytes 0 meta mark set 0x17f1
                ip daddr 192.168.8.192 tcp flags syn / syn,rst counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.9.88 tcp dport 8080 counter packets 17 bytes 862 meta mark set 0x18be
        }

        chain OUTPUT {
...

System Information (please complete the following information):

aauren commented 6 months ago

Ahh... This makes more sense. I take it that this was what was going on behind #1664 after looking into it more?

Basically, the problem here is that kube-router at its core does not support overloading services VIPs with multiple service definitions. If it worked in the past versions of kube-router that was by happen stance and not by design, and I'm sure that it broke in other scenarios that maybe didn't pertain to you.

There are multiple boundary cases in the code that going down this path won't work correctly, so rather than patching this one, I think its just better to say that kube-router doesn't support this use-case. The best path forward for you will be to create separate VIPs per service configuration.

rkojedzinszky commented 6 months ago

@aauren Thanks for the response. Unfortunately, we have built our infrastructure on this behavior. The strange thing for me is that when multiple services exist on the same VIP, and one of them is not ready, only the maxseg rules are missing, others are present well. Couldn't it be easily supported? Our use case is to be conservative on public IPv4 addresses, and open multiple ports on that IP, with different services and pods.

Until kube-router supports it, that may be a workaround if I myself place a global maxseg rule in some different nft table, without defining ports explicitly. That should work. Also, this could be done by kube-router also, until a more sophisticated solution is implemented.

Anyway, thanks for the great project.

aauren commented 6 months ago

@rkojedzinszky - Thanks for the feedback. If you find that there is a simple patch to kube-router that fixes this situation for you and doesn't impose too much change on the kube-router source code you're welcome to float it in a PR and we'll take a look. However, I wouldn't be able to guarantee that this use-case would always be provided for in kube-router even if we're able to carry a patch this time.

It would probably be better to look into other ways that you might be able to solve for this use-case outside kube-router or in addition to kube-router. The global maxseg rule may not be a bad alternative for you in the short-term.

rkojedzinszky commented 6 months ago

@aauren Does your statement kube-router at its core does not support overloading services VIPs corresponds to DSR services only, or also to normal (not DSR) services with same externalIP? So is that still by accident that I have correct nft+ipvs rules set up for different services with same VIPs, and just maxseg rules are missing?

rkojedzinszky commented 6 months ago

@aauren Oh, I checked nft rules more carefully. I see that the maxseg rules are created for the IP address without matching port specifications. What if the maxseg rule too would contain the port specification, and then it could be added when fwmark rules are added? Of course, that would result in more nft rules.

rkojedzinszky commented 6 months ago

@aauren Oh, I checked nft rules more carefully. I see that the maxseg rules are created for the IP address without matching port specifications. What if the maxseg rule too would contain the port specification, and then it could be added when fwmark rules are added? Of course, that would result in more nft rules.

Does this look reasonable as a quick fix?

diff --git a/pkg/controllers/proxy/network_services_controller.go b/pkg/controllers/proxy/network_services_controller.go
index bcc77990..f69c8233 100644
--- a/pkg/controllers/proxy/network_services_controller.go
+++ b/pkg/controllers/proxy/network_services_controller.go
@@ -1831,7 +1831,8 @@ func setupMangleTableRule(ip string, protocol string, port string, fwmark string
        }

        // setup iptables rule TCPMSS for DSR mode to fix mtu problem
-       mtuArgs := []string{"-d", ip, "-m", tcpProtocol, "-p", tcpProtocol, "--tcp-flags", "SYN,RST", "SYN", "-j", "TCPMSS",
+       if protocol == tcpProtocol {
+               mtuArgs := []string{"-d", ip, "-m", tcpProtocol, "-p", tcpProtocol, "--dport", port, "--tcp-flags", "SYN,RST", "SYN", "-j", "TCPMSS",
                        "--set-mss", strconv.Itoa(tcpMSS)}
                err = iptablesCmdHandler.AppendUnique("mangle", "PREROUTING", mtuArgs...)
                if err != nil {
@@ -1842,6 +1843,7 @@ func setupMangleTableRule(ip string, protocol string, port string, fwmark string
                if err != nil {
                        return errors.New("Failed to run iptables command to set up TCPMSS due to " + err.Error())
                }
+       }
        return nil
 }

@@ -1876,7 +1878,8 @@ func (ln *linuxNetworking) cleanupMangleTableRule(ip string, protocol string, po
        }

        // cleanup iptables rule TCPMSS
-       mtuArgs := []string{"-d", ip, "-m", tcpProtocol, "-p", tcpProtocol, "--tcp-flags", "SYN,RST", "SYN", "-j", "TCPMSS",
+       if protocol == tcpProtocol {
+               mtuArgs := []string{"-d", ip, "-m", tcpProtocol, "-p", tcpProtocol, "--dport", port, "--tcp-flags", "SYN,RST", "SYN", "-j", "TCPMSS",
                        "--set-mss", strconv.Itoa(tcpMSS)}
                exists, err = iptablesCmdHandler.Exists("mangle", "PREROUTING", mtuArgs...)
                if err != nil {
@@ -1901,6 +1904,7 @@ func (ln *linuxNetworking) cleanupMangleTableRule(ip string, protocol string, po
                                return errors.New("Failed to cleanup iptables command to set up TCPMSS due to " + err.Error())
                        }
                }
+       }

        return nil
 }