ClusterIP services not accessible when using flannel CNI from host machines in Kubernetes

nonsense commented 4 years ago

I am trying to access a Kubernetes service through its ClusterIP, from a pod that is attached to its host's network and has access to DNS, with:

  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet

However the host machine has no ip routes setup for the service CIDR, for example

➜  ~ k get services
NAME             TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
kubernetes       ClusterIP   100.64.0.1      <none>        443/TCP    25m
redis-headless   ClusterIP   None            <none>        6379/TCP   19m
redis-master     ClusterIP   100.64.63.204   <none>        6379/TCP   19m

➜  ~ k get pods -o wide
NAME                       READY   STATUS      RESTARTS   AGE   IP              NODE                                             NOMINATED NODE   READINESS GATES
redis-master-0             1/1     Running     0          18m   100.96.1.3      ip-172-20-39-241.eu-central-1.compute.internal   <none>           <none>

root@ip-172-20-39-241:/home/admin# ip route
default via 172.20.32.1 dev eth0
10.32.0.0/12 dev weave proto kernel scope link src 10.46.0.0
100.96.0.0/24 via 100.96.0.0 dev flannel.11 onlink
100.96.1.0/24 dev cni0 proto kernel scope link src 100.96.1.1
100.96.2.0/24 via 100.96.2.0 dev flannel.11 onlink
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.20.32.0/19 dev eth0 proto kernel scope link src 172.20.39.241

Expected Behavior

I expect that I should be able to reach services running on Kubernetes from the host machines, but I can only access headless services - those that return a pod IP.

The pod CIDR has ip routes setup, but the services CIDR doesn't.

Current Behavior

Can't access services through their ClusterIPs from host network.

Possible Solution

If I manually add an ip route to 100.64.0.0/16 via 100.96.1.1, ClusterIP are accessible. But this route is not there by default.

Your Environment

Flannel version: v0.11.0
kops version: Version 1.17.0-alpha.1 (git-501baf7e5)
Backend used (e.g. vxlan or udp): vxlan
Kubernetes version (if used):
Operating System and version:
Link to your project (optional):

choryuidentify commented 4 years ago

Exactly same as my experience. My setup is Kubernetes 1.17.2 + Flannel. when 'hostNetwork: true' is set, this behavior be appearing.

nonsense commented 4 years ago

Our workaround is to manually add the route to DNS through a DaemonSet as soon as there is at least one pod running on all workers (so that the cni0 interface appears).

MansM commented 4 years ago

issue on kubernetes/kubernetes: https://github.com/kubernetes/kubernetes/issues/87852

Our workaround is to manually add the route to DNS through a DaemonSet as soon as there is at least one pod running on all workers (so that the cni0 interface appears).

@nonsense have an example?

MansM commented 4 years ago

using @mikebryant 's workaround did the trick for me now: https://github.com/coreos/flannel/issues/1245#issuecomment-582612891

rdxmb commented 4 years ago

Just changed to host-gw and realized then that the problem was much bigger than I supposed: There is a big routing problem with k8 1.17 and flannel with vxlan , which affects ClusterIP, NodePorts and even LoadBalancerIPs managed by metallb.

Changing to host-gw fixes all of them. I wonder why this is not fixed or at least documented in a very prominent way.

Here ist my report of response time of a minio-Service (in seconds) before and after changing. The checks run on the nodes itself.

Screenshot_20200221_085815

Screenshot_20200221_085848

rdxmb commented 4 years ago

On a second datacenter, the response time was even more than a minute. I had to increase the monitoring-timeout to get these values.

Screenshot_20200221_090543

Screenshot_20200221_090631

nonsense commented 4 years ago

issue on kubernetes/kubernetes: kubernetes/kubernetes#87852

Our workaround is to manually add the route to DNS through a DaemonSet as soon as there is at least one pod running on all workers (so that the cni0 interface appears).

@nonsense have an example?

Yes, here it is: https://github.com/ipfs/testground/blob/master/infra/k8s/sidecar.yaml#L23

Note that this won't work, unless you have one pod on every host (i.e. another DaemonSet), so that cni0 exists. I know this is a hack, but I don't have a better solution.

In our case the first pod we expect on every host is s3fs - https://github.com/ipfs/testground/blob/master/infra/k8s/kops-weave/s3bucket.yml

MansM commented 4 years ago

@nonsense I fixed it by changing the backend of flannel to host-gw instead of vxlan:

kubectl edit cm -n kube-system kube-flannel-cfg

replace vxlan with host-gw
save
not sure if needed, but I did it anyway: kubectl delete pods -l app=flannel -n kube-system

maybe this works for you as well

MansM commented 4 years ago

Setting up a new cluster with flannel and not able to get any communication to work. I tried the host-gw change
kubectl edit cm -n kube-system kube-flannel-cfg
replace vxlan with host-gw

save

not sure if needed, but I did it anyway: kubectl delete pods -l app=flannel -n kube-system

maybe this works for you as well
but the issue persists. Would there be additional changes required? This is just a basic cluster setup and flannel configuration, all from scratch.

If you have issues with all network traffic and not just reaching services from pods hostnetwork: true, you have some other issues

archever commented 4 years ago

the same problem to me. try add a route to cni0 fixed for me:

ip r add 10.96.0.0/16 dev cni0

tobiasvdp commented 4 years ago

the 'host-gw' option is only possible to infrastructures that support layer2 interaction. most cloud providers don't.

davesargrad commented 4 years ago

Hi. It turns out that host-gw fixed my problem as well: #1268. To me this is a critical bug somewhere in the vxlan based pipline.

Capitrium commented 4 years ago

I had similar issues after upgrading our cluster from 1.16.x to 1.17.x (specifically https://github.com/uswitch/kiam/issues/378). Using host-gw is not an option for me as our cluster runs on AWS, but I was able to fix it by reverting kube-proxy back to 1.16.8.

I also can't reproduce this issue on our dev cluster after replacing kube-proxy with kube-router running in service-proxy mode (tested with v1.0.0-rc1).

Could this issue be caused by changes in kube-proxy?

mariusgrigoriu commented 4 years ago

Just curious, how many folks running into this issue are using hyperkube?

tkislan commented 4 years ago

I tried reverting from 1.17.3 to 1.16.8, but I was still experiencing the same problem. Only way how to fix this is to have DaemonSet running, and call ip r add 10.96.0.0/12 dev cni0 on every Node to fix the routing .. after that, it starts to route correctly

LuckySB commented 4 years ago

try on node and on pod with hostnetwork:true (podnet 10.244.2.0/24) coredns running on another node with podnet 10.244.1.0/24

without ip route add 10.96.0.0/16 ip packet sends to coredns pod with src ip 10.244.2.0 IP 10.244.2.0.31782 > 10.244.1.3.domain: 38996+ [1au] A? kubernetes.default. (59)

and tcpdump not show this packet on another side on vxlan tunnel tcpdump -ni flanel.1

With route

10.96.0.0/16 dev cni0 scope link

src ip changed to addres from cni0, not a flanel.1 interface

4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether 0a:af:85:2e:82:f5 brd ff:ff:ff:ff:ff:ff
    inet 10.244.2.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
5: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether ee:39:df:66:22:f3 brd ff:ff:ff:ff:ff:ff
    inet 10.244.2.1/24 scope global cni0
       valid_lft forever preferred_lft forever

and acces to service ipnet works fine.

well direct access to dns pod working dig @10.244.1.8 kubernetes.default.svc.cluster.local and tcpdump show udp request with 10.244.2.0 src address but acces to cluster 10.96.0.10 ip not!

i try to remove iptables rule created by kube-proxe

iptables -t nat -D POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING

and i get answer from coredns


;; ANSWER SECTION:
kubernetes.default.svc.cluster.local. 30 IN A   10.96.0.1

also work with iptables -t nat -I POSTROUTING 1 -o eth0 -j ACCEPT

LuckySB commented 4 years ago

I don’t understand anything. If you simply insert the command

iptables -t nat -I KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE

then everything starts to work

iptables -t nat -S

-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
-A POSTROUTING -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
-A POSTROUTING ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE
-A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE

aojea commented 4 years ago

I went ahead and submit a patch to flannel to fix/alleviate the problem, however, I can't reproduce it nor test it, can somebody try my patch and report back in the PR?

coreos/flannel#1282

mengmann commented 4 years ago

I'm having this issue with vxlan backend both with flannel version 0.11 and 0.12 aswell. Affected kubernetes versions 1.16.X, 1.17.x and 1.18.x.

Finally setting up a static route on my nodes to service network through cni0 interface helped me instantly: ip route add 10.96.0.0/12 dev cni0

os: CentOS 7 install method: kubeadm underlying plattform: Virtualbox 6

kadisi commented 4 years ago

exec those commands on every node will be ok:

ethtool -K flannel.1 tx-checksum-ip-generic off

Gacko commented 4 years ago

Can anyone explain why this issue does not occur on older Kubernetes releases? At least I'm not facing it in 1.16.4 with the exact same setup as in 1.17.5 and 1.18.2. Did Kubernetes disable checksum offloading in the past?

mariusgrigoriu commented 4 years ago

I would also like to know what changed because we haven't seen issues of this magnitude on 1.16, but certainly saw something (can't say if it was this) on 1.17. I can confirm that checksum offloading is enabled on our 1.16 nodes.

Gacko commented 4 years ago

Some more details about my environment:

OS: Oracle Linux 7.8 Kernel: Linux kube-1 3.10.0-1127.el7.x86_64 #1 SMP Wed Apr 1 10:20:09 PDT 2020 x86_64 x86_64 x86_64 GNU/Linux Docker: 18.09.9 Kubernetes: 1.16.9 Flannel: 0.12.0

OS: Oracle Linux 7.8 Kernel: Linux kube-1 3.10.0-1127.el7.x86_64 #1 SMP Wed Apr 1 10:20:09 PDT 2020 x86_64 x86_64 x86_64 GNU/Linux Docker: 19.03.4 Kubernetes: 1.17.5 Flannel: 0.12.0

So I'm basically always using the same virtual machines with the same operating system and the same kernel version running on the same single ESXi 6.7U3 host using VMXNET 3 interfaces. The only thing changing is the Docker version (18.09.9 <> 19.03.4) and the Kubernetes version (1.16.9 <> 1.17.5). Even kubernetes-cni package is 0.7.5 in both cases.

Test case is quite simple: Using cURL to call a nginx running on a different node via service IP. Some tcpdump following:

Kubernetes 1.16.9

flannel.1 interface @ kube-1 (source host)

tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
15:47:36.717742 IP (tos 0x0, ttl 64, id 45086, offset 0, flags [DF], proto TCP (6), length 60)
    172.20.0.0.60120 > 172.20.3.2.80: Flags [S], cksum 0x5b59 (incorrect -> 0xb990), seq 2764128936, win 28200, options [mss 1410,sackOK,TS val 389814 ecr 0,nop,wscale 7], length 0
E..<..@.@./s...........P..B.......n([Y.........
............
15:47:36.718240 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.20.3.2.80 > 172.20.0.0.60120: Flags [S.], cksum 0x4824 (correct), seq 399076314, ack 2764128937, win 27960, options [mss 1410,sackOK,TS val 388770 ecr 389814,nop,wscale 7], length 0
E..<..@.?............P....k...B...m8H$.........
............
15:47:36.718286 IP (tos 0x0, ttl 64, id 45087, offset 0, flags [DF], proto TCP (6), length 52)
    172.20.0.0.60120 > 172.20.3.2.80: Flags [.], cksum 0x5b51 (incorrect -> 0xe318), seq 1, ack 1, win 221, options [nop,nop,TS val 389815 ecr 388770], length 0
E..4..@.@./z...........P..B...k.....[Q.....
........
15:47:36.718427 IP (tos 0x0, ttl 64, id 45088, offset 0, flags [DF], proto TCP (6), length 126)
    172.20.0.0.60120 > 172.20.3.2.80: Flags [P.], cksum 0x5b9b (incorrect -> 0xc2db), seq 1:75, ack 1, win 221, options [nop,nop,TS val 389815 ecr 388770], length 74: HTTP, length: 74
    GET / HTTP/1.1
    User-Agent: curl/7.29.0
    Host: 172.20.3.2
    Accept: */*

E..~. @.@.//...........P..B...k.....[......
........GET / HTTP/1.1
User-Agent: curl/7.29.0
Host: 172.20.3.2
Accept: */*

...

ens192 interface @ kube-1 (source host, already encapsulated)

tcpdump: listening on ens192, link-type EN10MB (Ethernet), capture size 262144 bytes
15:47:36.717981 IP (tos 0x0, ttl 64, id 23819, offset 0, flags [none], proto UDP (17), length 110)
    192.168.0.11.37330 > 192.168.0.14.8472: [no cksum] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 45086, offset 0, flags [DF], proto TCP (6), length 60)
    172.20.0.0.60120 > 172.20.3.2.80: Flags [S], cksum 0xb990 (correct), seq 2764128936, win 28200, options [mss 1410,sackOK,TS val 389814 ecr 0,nop,wscale 7], length 0
E..n]...@..
..........!..Z...........f...Q........E..<..@.@./s...........P..B.......n(...........
............
15:47:36.718219 IP (tos 0x0, ttl 64, id 19035, offset 0, flags [none], proto UDP (17), length 110)
    192.168.0.14.46327 > 192.168.0.11.8472: [no cksum] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.20.3.2.80 > 172.20.0.0.60120: Flags [S.], cksum 0x4824 (correct), seq 399076314, ack 2764128937, win 27960, options [mss 1410,sackOK,TS val 388770 ecr 389814,nop,wscale 7], length 0
E..nJ[..@.............!..Z.................f...Q..E..<..@.?............P....k...B...m8H$.........
............
15:47:36.718303 IP (tos 0x0, ttl 64, id 23820, offset 0, flags [none], proto UDP (17), length 102)
    192.168.0.11.37330 > 192.168.0.14.8472: [no cksum] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 45087, offset 0, flags [DF], proto TCP (6), length 52)
    172.20.0.0.60120 > 172.20.3.2.80: Flags [.], cksum 0xe318 (correct), seq 1, ack 1, win 221, options [nop,nop,TS val 389815 ecr 388770], length 0
E..f]...@.............!..R...........f...Q........E..4..@.@./z...........P..B...k............
........
15:47:36.718438 IP (tos 0x0, ttl 64, id 23821, offset 0, flags [none], proto UDP (17), length 176)
    192.168.0.11.37330 > 192.168.0.14.8472: [no cksum] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 45088, offset 0, flags [DF], proto TCP (6), length 126)
    172.20.0.0.60120 > 172.20.3.2.80: Flags [P.], cksum 0xc2db (correct), seq 1:75, ack 1, win 221, options [nop,nop,TS val 389815 ecr 388770], length 74: HTTP, length: 74
    GET / HTTP/1.1
    User-Agent: curl/7.29.0
    Host: 172.20.3.2
    Accept: */*

E...]...@.............!..............f...Q........E..~. @.@.//...........P..B...k............
........GET / HTTP/1.1
User-Agent: curl/7.29.0
Host: 172.20.3.2
Accept: */*

Works fine.

Kubernetes 1.17.5

flannel.1 interface @ kube-1 (source host)

tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
16:08:11.009312 IP (tos 0x0, ttl 64, id 54000, offset 0, flags [DF], proto TCP (6), length 60)
    172.20.0.0.38646 > 172.20.3.2.80: Flags [S], cksum 0x5b59 (incorrect -> 0xa53e), seq 1361874397, win 29200, options [mss 1460,sackOK,TS val 614189 ecr 0,nop,wscale 7], length 0
E..<..@.@..............PQ,........r.[Y.........
.   _-........
16:08:12.011374 IP (tos 0x0, ttl 64, id 54001, offset 0, flags [DF], proto TCP (6), length 60)
    172.20.0.0.38646 > 172.20.3.2.80: Flags [S], cksum 0x5b59 (incorrect -> 0xa153), seq 1361874397, win 29200, options [mss 1460,sackOK,TS val 615192 ecr 0,nop,wscale 7], length 0
E..<..@.@..............PQ,........r.[Y.........
.   c.........
16:08:14.015336 IP (tos 0x0, ttl 64, id 54002, offset 0, flags [DF], proto TCP (6), length 60)
    172.20.0.0.38646 > 172.20.3.2.80: Flags [S], cksum 0x5b59 (incorrect -> 0x997f), seq 1361874397, win 29200, options [mss 1460,sackOK,TS val 617196 ecr 0,nop,wscale 7], length 0
E..<..@.@..............PQ,........r.[Y.........
.   j.........

ens192 interface @ kube-1 (source host, already encapsulated)

tcpdump: listening on ens192, link-type EN10MB (Ethernet), capture size 262144 bytes
16:08:11.009339 IP (tos 0x0, ttl 64, id 41735, offset 0, flags [none], proto UDP (17), length 110)
    192.168.0.11.20459 > 192.168.0.14.8472: [bad udp cksum 0xffff -> 0x5661!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 54000, offset 0, flags [DF], proto TCP (6), length 60)
    172.20.0.0.38646 > 172.20.3.2.80: Flags [S], cksum 0xa53e (correct), seq 1361874397, win 29200, options [mss 1460,sackOK,TS val 614189 ecr 0,nop,wscale 7], length 0
E..n....@.V.........O.!..Z........................E..<..@.@..............PQ,........r..>.........
.   _-........
16:08:12.011414 IP (tos 0x0, ttl 64, id 42018, offset 0, flags [none], proto UDP (17), length 110)
    192.168.0.11.20459 > 192.168.0.14.8472: [bad udp cksum 0xffff -> 0x5661!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 54001, offset 0, flags [DF], proto TCP (6), length 60)
    172.20.0.0.38646 > 172.20.3.2.80: Flags [S], cksum 0xa153 (correct), seq 1361874397, win 29200, options [mss 1460,sackOK,TS val 615192 ecr 0,nop,wscale 7], length 0
E..n."..@.T.........O.!..Z........................E..<..@.@..............PQ,........r..S.........
.   c.........
16:08:14.015351 IP (tos 0x0, ttl 64, id 43898, offset 0, flags [none], proto UDP (17), length 110)
    192.168.0.11.20459 > 192.168.0.14.8472: [bad udp cksum 0xffff -> 0x5661!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 54002, offset 0, flags [DF], proto TCP (6), length 60)
    172.20.0.0.38646 > 172.20.3.2.80: Flags [S], cksum 0x997f (correct), seq 1361874397, win 29200, options [mss 1460,sackOK,TS val 617196 ecr 0,nop,wscale 7], length 0
E..n.z..@.M.........O.!..Z........................E..<..@.@..............PQ,........r............
.   j.........

ens192 interface @ kube-4 (target host, still encapsulated)

tcpdump: listening on ens192, link-type EN10MB (Ethernet), capture size 262144 bytes
16:08:11.009369 IP (tos 0x0, ttl 64, id 41735, offset 0, flags [none], proto UDP (17), length 110)
    192.168.0.11.20459 > 192.168.0.14.8472: [bad udp cksum 0xffff -> 0x5661!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 54000, offset 0, flags [DF], proto TCP (6), length 60)
    172.20.0.0.38646 > 172.20.3.2.80: Flags [S], cksum 0xa53e (correct), seq 1361874397, win 29200, options [mss 1460,sackOK,TS val 614189 ecr 0,nop,wscale 7], length 0
E..n....@.V.........O.!..Z........................E..<..@.@..............PQ,........r..>.........
.   _-........
16:08:12.011488 IP (tos 0x0, ttl 64, id 42018, offset 0, flags [none], proto UDP (17), length 110)
    192.168.0.11.20459 > 192.168.0.14.8472: [bad udp cksum 0xffff -> 0x5661!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 54001, offset 0, flags [DF], proto TCP (6), length 60)
    172.20.0.0.38646 > 172.20.3.2.80: Flags [S], cksum 0xa153 (correct), seq 1361874397, win 29200, options [mss 1460,sackOK,TS val 615192 ecr 0,nop,wscale 7], length 0
E..n."..@.T.........O.!..Z........................E..<..@.@..............PQ,........r..S.........
.   c.........
16:08:14.015389 IP (tos 0x0, ttl 64, id 43898, offset 0, flags [none], proto UDP (17), length 110)
    192.168.0.11.20459 > 192.168.0.14.8472: [bad udp cksum 0xffff -> 0x5661!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 54002, offset 0, flags [DF], proto TCP (6), length 60)
    172.20.0.0.38646 > 172.20.3.2.80: Flags [S], cksum 0x997f (correct), seq 1361874397, win 29200, options [mss 1460,sackOK,TS val 617196 ecr 0,nop,wscale 7], length 0
E..n.z..@.M.........O.!..Z........................E..<..@.@..............PQ,........r............
.   j.........

Does not work, packets get dropped at target host due to bad checksum.

For my understanding some component is adding a bad checksum to the encapsulated packets starting from Kubernetes 1.17.x.

In Kubernetes 1.16.x the same packets do not have a checksum after encapsulation which is what RFC describes to be working as expected.

Correct me if I'm wrong but as far as I understand flannel is only writing the CNI config and kubelet is creating a the flannel.1 interface by using this CNI config and the CNI plugin binaries supplied by the kubernetes-cni package which is version 0.7.5 in both cases.

As I'm using the same flannel version and the same flannel config in both cases the CNI config created by flannel should be the same.

So how could it be possible that packages get wrong checksums on 1.17.x while they do not have any on 1.16.x?

jhohertz commented 4 years ago

I may found something relevant... out of desperation went poking around a diff of the release-1.16 branches and release-1.17 branches of kubernetes.

I think this has something to do with it: kubernetes/kubernetes#83576

The changes in that dependency specifically has aspects relating to vxlan and checksum handling.

This is speculation at this point, but it does speak to the workarounds people are finding.

Gacko commented 4 years ago

I just updated the kubelet, kubeadm and kubectl packages from 1.16.9 to 1.17.5 without updating the cluster components itself (API server, scheduler, controller manager, kube-proxy, flannel, etcd). After a reboot my test via cURL still works. I tried to do a cURL from a worker node to a pod on another worker node via its service IP.

From my point of view the problem is not related to kubelet v1.17.5. Furthermore I don't think it's related to flannel since I'm still using the exact same version and configuration in working cluster as in broken clusters (see test documentation above). The only thing left to be changed during a full update would be kube-proxy. So at the moment I think the issue has something to do with kube-proxy.

Gacko commented 4 years ago

Well, things get weird...

Just reset my whole cluster to 1.16.9. Everything works as expected. Then did the following:

kubectl edit daemonset -n kube-system kube-proxy

... and set the image version of kube-proxy to 1.17.5. Well, that's not really an update nor a recommended change but: My test does not work anymore. When I'm rolling it back to 1.16.9 everything starts to work again.

So it really depends on the version of kube-proxy. I totally understand that there might be some kind of kernel bug and that changes made to kube-proxy are totally legit. But I'm still interested in what's different between those version and the way they setup iptables for example.

Gacko commented 4 years ago

I just compared iptables created by v1.16.9 and v1.17.5. There's one difference in the NAT table:

Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination         
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000

In v1.17.5 this rule exists twice, in v1.16.9 only once. Everything else is identical, only one probability changes minimally in the decimal range.

I don't know, but can the double rule have an effect on the checksum?

aojea commented 4 years ago

In v1.17.5 this rule exists twice, in v1.16.9 only once. Everything else is identical, only one probability changes minimally in the decimal range.

I checked with v1.19 and I don't have duplicate rules, @Gacko can you paste the output with iptables-save | grep KUBE-POSTROUTING?

Gacko commented 4 years ago

I'll check this today. Maybe the duplicate rules only occur during update. I also don't think it's really related to the issue but I also think there shouldn't be duplicate rules in any case. They also occurred after a restart. I'll keep you updated.

s97712 commented 4 years ago

This problem doesn't happen on master node, and I noticed that master has two more rules than other nodes, Is it related to this?

[root@master ~]# ip route
...
169.254.0.0/16 dev eth0 scope link metric 1002 
169.254.0.0/16 dev public scope link metric 1003 
...

Gacko commented 4 years ago

Problem occurs on master nodes. At least in my test setups. Is it possible the pod you tried to access via a service IP was running on the same node (master)?

Regarding your output: Those are actually rules created for zeroconf networking. Maybe this kind of automatic adress assignment is disabled on other nodes.

Gacko commented 4 years ago

Sorry for being late to the party... I just installed a clean v1.17 cluster, no duplicate iptables routes in there. So it seems like they only occur after upgrading. Anyways the issue persists. I'll continue investigating...

zhangguanzhang commented 4 years ago

see this https://github.com/kubernetes/kubernetes/issues/88986#issuecomment-635640143

kubealex commented 4 years ago

Just a side note, the issue doesn't happen on the node where the pods balanced by the service are deployed:

NAME                                        READY   STATUS      RESTARTS   AGE   IP           NODE                   NOMINATED NODE   READINESS GATES
ingress-nginx-admission-create-fppsm        0/1     Completed   0          26m   10.244.2.2   k8s-worker-0.k8s.lab   <none>           <none>
ingress-nginx-admission-patch-xnfcw         0/1     Completed   0          26m   10.244.2.3   k8s-worker-0.k8s.lab   <none>           <none>
ingress-nginx-controller-69fb496d7d-2k594   1/1     Running     0          26m   10.244.2.6   k8s-worker-0.k8s.lab   <none>           <none>

[kube@k8s-worker-0 ~]$ curl 10.100.76.252
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx/1.17.10</center>
</body>
</html>

[kube@k8s-master-0 ~]$ curl 10.100.76.252
^C

malikbenkirane commented 4 years ago

Our workaround is to manually add the route to DNS through a DaemonSet as soon as there is at least one pod running on all workers (so that the cni0 interface appears).

@nonsense could you please provide another example manifest for this

Yes, here it is: https://github.com/ipfs/testground/blob/master/infra/k8s/sidecar.yaml#L23

Ends on 404

nonsense commented 4 years ago

@malikbenkirane change ipfs/testground to testground/infra - repo moved - https://github.com/testground/infra/blob/master/k8s/sidecar.yaml

malikbenkirane commented 4 years ago

@malikbenkirane change ipfs/testground to testground/infra - repo moved - https://github.com/testground/infra/blob/master/k8s/sidecar.yaml

Thanks, I like the idea. Though I've found using calico rather than flannel working for me. I just had set --flannel-backend=none and followed calico k3s steps changing pod cidr accordingly.

mohideen commented 4 years ago

I had the same issue on a HA cluster provisioned by kubeadm with RHEL7 nodes. Both the options (turning of tx-checksum-ip-generic / switching to host-gw from vxlan) worked. Settled with the host-gw option.

This did not affect a RHEL8 cluster provisioned by kubeadm (also that was not a HA cluster).

Gacko commented 4 years ago

I guess this can be closed since the related issues have been fixed in Kubernetes.

rdxmb commented 4 years ago

@Gacko could you link the issue/PR for that, please?

rafzei commented 4 years ago

@rdxmb this one: #92035 and changelog

rdxmb commented 4 years ago

@rafzei thanks :+1:

muthu31kumar commented 4 years ago

+1

immanuelfodor commented 4 years ago

I've bumped into the same issue with an RKE v1.19.3 k8s cluster running on CentOS 8 with firewalld completely disabled. The CNI plugin is Canal which uses both Flannel and Calico. Only pods running with hostNetwork: true and ClusterFirstWithHostNet were affected, they couldn't get DNS resolution on nodes that weren't running a CoreDNS pod. As I had 3 nodes and my CoreDNS replica count was set to 2 by the autoscaler, only pods on the 3rd node were affected. As RKE doesn't support manual CoreDNS autoscaling parameters (open issue here: https://github.com/rancher/rke/issues/2247), my solution was to explicitly set the Flannel backend to host-gw from an implicit vxlan in the RKE cluster.yml file. See the docs here: https://rancher.com/docs/rke/latest/en/config-options/add-ons/network-plugins/#canal-network-plug-in-options After that, I did an rke up to apply the changes but it did not have any effect at first, therefore I also needed to reboot all nodes to fix the issue. Now all pods with hostNetwork: true and ClusterFirstWithHostNet on all nodes are working fine.

 network:
   plugin: canal
-  options: {}
+  options:
+    # workaround to get hostnetworked pods DNS resolution working on nodes that don't have a CoreDNS replica running
+    # do the rke up then reboot all nodes to apply
+    # @see: https://github.com/coreos/flannel/issues/1243#issuecomment-589542796
+    # @see: https://rancher.com/docs/rke/latest/en/config-options/add-ons/network-plugins/
+    canal_flannel_backend_type: host-gw
   mtu: 0
   node_selector: {}
   update_strategy: null

Hitendraverma commented 3 years ago

I am also getting intermittent issue while running stateful sets in k8 with hostNetwork. hostNetwork: true dnsPolicy: ClusterFirstWithHostNet

I got it resolved by following below.

Migrated my Kube-dns to Core-dns.
Adding Envoy for permanent fix of intermittent issue of DNS lookup

Also , you can temporarily fix this by running your DNS pod on the same node on which your Application pod is running. Schedule DNS pod by node selector or by making your other nodes SchedulingDisabled.

legoguy1000 commented 2 years ago

I just upgraded our K8S bare metal cluster running on physical servers v1.23.13 to flannel v0.20.1 from v0.17.0 and are having this issue. My pods with hostNetwork: true can't connect to any other service via ClusterIPs. I fixed by adding the static route via the cni0 interface as suggested https://github.com/flannel-io/flannel/issues/1243#issuecomment-596375224.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

flannel-io / flannel