Upgrade CNI version broke pod-to-pod communication within the same worker node

rimaulana commented 5 years ago

After upgrading the CNI version from v1.5.1-rc1 to v1.5.4, we are seeing issue where pod was unable to communicate with other pod on the same worker node. We have the following schema

CoreDNS pod on eth0 Kibana pod on eth0 App1 on eth1 App2 on eth2

What we are seeing is that DNS query from App1 and App2 failed with no server found when we tried it using dig command

dig @CoreDNS-ip amazonaws.com

Meanwhile, executing the same command from Kibana pod, the worker node and pod on a different worker node works as expected.

When collecting the logs using https://github.com/nithu0115/eks-logs-collector, we found out that CoreDNS IP was not found anywhere on the output of the ip rule show command. I would expect for each IP address of a pod running on the worker node it should have at least this associated rule on the ip rule

512: from all to POD_IP lookup main

However, we do not see one for the CoreDNS pod IP. Therefore, we believe that this is an issue with the CNI plugin unable to rebuild the rule after upgrade. There is an internal issue open for this if you want to get the collected logs

MartiUK commented 5 years ago

Downgrading to v1.5.3 resolved this issue on (EKS) k8s v1.14 with CoreDNS v1.3.1. Required node reboots first.

mogren commented 5 years ago

Glad you found a work-around (rebooting the nodes), but I'll keep trying to reproduce this.

igor-pinchuk commented 5 years ago

Facing the same issue. Downgrading to v1.5.3 with following nodes rebooting helped.

ueokande commented 5 years ago

We encountered the issue with Kubernetes 1.13 (eks.4) and amazon-vpc-cni-k8s v1.5.4. Its not only on CoreDNS, but also inter-pods communication.

It occurs immediately after cluster created. We just repaired by restarting pods (release and reassign an IP address on the pod):

$ kubectl delete pod --all
$ kubectl delete pod -nkube-system --all

dmarkey commented 5 years ago

I've been tearing my hair out all day after upgrading a cluster. Please change https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html to suggest v1.5.3 and not v1.5.4 as to not break more clusters until it's verified that this bug is fixed.

mogren commented 5 years ago

@dmarkey None of the three minor changes between v1.5.3 and v1.5.4 has anything to do with routes, so I suspect there is some other existing issue that we have not been able to reproduce yet. Does rebooting the nodes without downgrading not fix the issue?

We have seen related issues with routes when using Calico, but they are the same on v1.5.3 and v1.5.4. Still investigating this.

angelichorsey commented 5 years ago

This is a sysctl fix, no?

net.bridge.bridge-nf-call-ip6tables=1
net.bridge.bridge-nf-call-iptables=1
net.bridge.bridge-nf-call-arptables=1

If you don't have these then the docker bridge can't talk back to itself.

https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/#network-plugin-requirements

nithu0115 commented 5 years ago

@dmarkey are you seeing missing rule from routing table database ? Could you elaborate more on the issue you are running into ?

schahal commented 5 years ago

Can we please update https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/release-1.5/config/v1.5/aws-k8s-cni.yaml to be 1.5.3 until 1.5.4 is fully vetted? We are running into the same issue and want default to be the working version.

dmarkey commented 5 years ago

The main issue was around 10% of pods not being able to talk to other pods, like coredns, and therefore couldn't resolve and/or connect to dependent services. They could however connect to services on the internet.

I also noticed that for the problematic pods. Their IP was missing from the node ifconfig output. I assume they would need an interface added that would be visible on the host?

dmarkey commented 5 years ago

I have powered up the cluster twice from scratch with ~200 pods with 1.5.3 and it has come up flawlessly.

With 1.5.4 about 20% of pods couldn't find dependencies either by not being able resolve their address(services in the same namespace mostly), or couldn't reach the dependency at all. I must have powered up the ASG about 10 times to try to troubleshoot the situation.

mogren commented 5 years ago

@dmarkey Thanks for the update, will keep testing this. @schahal I have reverted config/v1.5/aws-k8s-cni.yaml to point to v1.5.3 for now.

mogren commented 5 years ago

@dmarkey Could you please send me log output from https://github.com/awslabs/amazon-eks-ami/tree/master/log-collector-script ? (Either mogren at amazon.com or c.m in the Kubernetes slack)

dmarkey commented 5 years ago

Do you mean with 1.5.3 or 1.5.4? I'm afraid this cluster is in active use (although not classed as "production") so I cant easily revert without causing at least some disruption. Either way I don't have access until AM Irish time Monday.

mogren commented 5 years ago

@dmarkey Logs from a node where you see the communication issue, so v1.5.4. If you could get that next week I'd be very thankful. Sorry to cause bother on a Friday evening! 🙂

mogren commented 5 years ago

I have still not been able to reproduce this issue, and I have not gotten any logs showing errors in the CNI, but I have seen a lot of errors in the CoreDNS logs. If anyone can reliably reproduce the issue, or find a missing route or iptable rule, I'd be happy to know more.

ayosec commented 5 years ago

We had a similar problem today, with 1.5.4.

Yesterday, we changed the configuration of the deployment to set AWS_VPC_K8S_CNI_LOGLEVEL=INFO, so the aws-node-* pods were restarted. We checked if it was able to assign IP address to new pods, and everything was working as expected.

Today, we updated some deployments, and then we started to see 504 Gateway Timeout errors in some requests.

After some investigation we found that the ingress controller was not able to connect to the pods when they were in the same node. The pod (with IP 10.200.254.228) was accessible from ingress controllers in other nodes.

We discarded a bug in the ingress controller because even a ping was not possible:

# nsenter -t 1558 -n ping -c 2 10.200.254.228
PING 10.200.254.228 (10.200.254.228) 56(84) bytes of data.

--- 10.200.254.228 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1001ms

(1558 is the PID of the ingress controller).

The ping worked from the host network.

After more investigation, we found an issue in the IP rules:

# ip rule show
0:  from all lookup local 
512:    from all to 10.200.211.143 lookup main 
512:    from all to 10.200.204.145 lookup main 
512:    from all to 10.200.212.149 lookup main 
512:    from all to 10.200.206.165 lookup main 
512:    from all to 10.200.236.131 lookup main 
512:    from all to 10.200.202.149 lookup main 
512:    from all to 10.200.220.69 lookup main 
512:    from all to 10.200.223.122 lookup main 
512:    from all to 10.200.212.190 lookup main 
512:    from all to 10.200.206.240 lookup main 
1024:   from all fwmark 0x80/0x80 lookup main 
1536:   from 10.200.222.108 to 10.200.0.0/16 lookup 2 
1536:   from 10.200.254.228 to 10.200.0.0/16 lookup 3 
1536:   from 10.200.221.230 to 10.200.0.0/16 lookup 3 
1536:   from 10.200.211.143 to 10.200.0.0/16 lookup 3 
1536:   from 10.200.204.145 to 10.200.0.0/16 lookup 3 
1536:   from 10.200.212.149 to 10.200.0.0/16 lookup 2 
1536:   from 10.200.206.165 to 10.200.0.0/16 lookup 3 
1536:   from 10.200.236.131 to 10.200.0.0/16 lookup 2 
1536:   from 10.200.202.149 to 10.200.0.0/16 lookup 2 
1536:   from 10.200.220.69 to 10.200.0.0/16 lookup 2 
1536:   from 10.200.223.122 to 10.200.0.0/16 lookup 2 
1536:   from 10.200.206.240 to 10.200.0.0/16 lookup 2 
32766:  from all lookup main 
32767:  from all lookup default

In the previous list, you can see that 10.200.254.228 is missing in the from all to ... rules.

We added it manually:

# ip rule add from all to 10.200.254.228 lookup main

And the issue was fixed.

We checked the logs, and the only error related to 10.200.254.228 is the following (in plugin.log):

2019-10-14T03:55:21.684Z [INFO] Received CNI del request: ContainerID(ae40e6b983f6f3cb21753559ed9eb10eb7e7a341ce3a9afe975078d65d9002ec) Netns(/proc/23768/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=staging;K8S_POD_NAME=redacted-58948849cf-bjlfb;K8S_POD_INFRA_CONTAINER_ID=ae40e6b983f6f3cb21753559ed9eb10eb7e7a341ce3a9afe975078d65d9002ec) Path(/opt/cni/bin) argsStdinData({"cniVersion":"0.3.1","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-10-14T03:55:21.688Z [ERROR]    Failed to delete toContainer rule for 10.200.254.228/32 err no such file or directory
2019-10-14T03:55:21.688Z [INFO] Delete Rule List By Src [{10.200.254.228 ffffffff}]
2019-10-14T03:55:21.688Z [INFO] Remove current list [[ip rule 1536: from 10.200.254.228/32 table 3]]
2019-10-14T03:55:21.688Z [INFO] Delete fromContainer rule for 10.200.254.228/32 in table 3

mogren commented 4 years ago

@ayosec Thanks a lot for the helpful details!

Magizhchi commented 4 years ago

We are facing the same issue, pod to pod communication is intermittently going down, restarting the pods brings it back up.

We followed the suggestion above to downgrade to 1.5.3 and restart the node which worked for us.

So maybe there is some issue with v1.5.4

yydzhou commented 4 years ago

@dmarkey Thanks for the update, will keep testing this. @schahal I have reverted config/v1.5/aws-k8s-cni.yaml to point to v1.5.3 for now.

https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/release-1.5/config/v1.5/aws-k8s-cni.yaml The version is still 1.5.4 and we still hit the issue.

mogren commented 4 years ago

@yydzhou So far in our tests v1.5.3 and v1.5.4 behave the same, and the issues we have been able to reproduce seem to be resolved by restarting CoreDNS. If you have any logs from the CNI showing errors in setting up routes or rules, please send them to me.

owlwalks commented 4 years ago

for who stopping-by: i tried every combinations possible between coredns and kube-proxy and eks cni, ended up with cilium. so far so good, even alpine containers query DNS just fine.

the root cause is conntrack race condition in linux kernel:

mogren commented 4 years ago

@owlwalks Thanks for the helpful comment. Cilium uses eBPF, so I guess that is how they the conntrack issues.

There is a related issue in https://github.com/awslabs/amazon-eks-ami/issues/357 and the Amazon Linux team is aware of the problem.

anand99 commented 4 years ago

EKS clusters created with CNI v1.5.4 still giving timeout issues for pod-to-pod communication on same node, even after killing existing CoreDNS pods. Restarting CoreDNS might take away domain resolution errors, but still pod-to-pod communication issue stays.

For example, one pod talks to another one using grpc, throwing following error:

msg="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.100.5.212:50051: i/o timeout\""

10.100.5.212 is the ClusterIP of the service.

If we delete/restart the pods involved in inter-pod communication, then this issue of time out gets resolved and the applications start to work as expected.

nachomillangarcia commented 4 years ago

I am also facing this in one of the nodes. By checking connections between all pods, I could see that the connectivity problem was only between pods whose IPs are in different ENIs. No matter if it is CoreDNS or any other container.

Then, checking the IP rules, I noticed that the rules for all IPs of the principal ENI were missing.

I couldn't find any log about that error, will be glad to share them if someone interested.

vmrm commented 4 years ago

Same thing with iprule and with 1.5.4 cni version in EKS 1.14 cluster, downgrading to 1.5.3 helped.

jwenz723 commented 4 years ago

I just created a new EKS 1.14 cluster. I tried to install linkerd by running linkerd install | kubectl apply -f - using the linkerd cli version 2.6.0. After the install completes not all of the pods start up. I let the cluster sit for a few hours to see if it would eventually resolve, never did.

Next I downgraded the AWS VPC CNI from v1.5.4 to v1.5.3, suddenly all my pods started and are in a Ready state.

Definitely seems to be something wrong with v1.5.4. It seems like it may be affecting DNS or Admission controllers.

I see logs like this in multiple pods:

ERR! [ 24115.102053s] admin={bg=identity} linkerd2_proxy::app::identity Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"

victorboissiere commented 4 years ago

Same with 1.5.4 cni version in EKS 1.13 cluster, downgrading to 1.5.3 helped. I had to migrate all workloads to new worker nodes and I am not seing pod to pod communication errors on the same node. I do not see any timeouts errors now.

awprice commented 4 years ago

I'm not sure this of the cause of issues from upgrading 1.5.3 to 1.5.4, but I noticed that the v1.5.4 tag is missing a commit that is on v1.5.3: https://github.com/aws/amazon-vpc-cni-k8s/commit/d7bac816d0d7dec16d9d61d5cec5f401f02dc022

https://github.com/aws/amazon-vpc-cni-k8s/commits/v1.5.4 https://github.com/aws/amazon-vpc-cni-k8s/commits/v1.5.3

Was there a reason why this was done @mogren?

kuberkaul commented 4 years ago

Well, this is blocking for us as well. We have over a 100 nodes and 30 clusters. The manual rollback doesnt work for us especially since we need other features for K*S to keep upgrading and CNI comes with the backed ami.

Any ETA on when this will be fixed and pushed upstream ?

mogren commented 4 years ago

@awprice Good catch, I'm afraid that commit wasn't merged back into the release-1.5 branch correctly. I'm testing a v1.5.5 build right now with that commit and #667, and so far I don't see any of the issues we have seen with v1.5.4.

awprice commented 4 years ago

@mogren I'm a little confused. Why is the addition of https://github.com/aws/amazon-vpc-cni-k8s/pull/667 the fix for the issues upgrading from 1.5.3 -> 1.5.4? The only difference I can see from 1.5.3 and 1.5.4 is the missing commit https://github.com/aws/amazon-vpc-cni-k8s/commit/d7bac816d0d7dec16d9d61d5cec5f401f02dc022 I mentioned above.

Is introducing another "fix" the right approach? Does only adding https://github.com/aws/amazon-vpc-cni-k8s/commit/d7bac816d0d7dec16d9d61d5cec5f401f02dc022 to 1.5.5 fix the issues too?

mogren commented 4 years ago

This issue is caused by #623. Configs in master and release-1.5 branches have been back to v1.5.3 to avoid ip rule issues.

I will work to get a v1.5.5 out soon with a revert of #623, but with the new instance types. (Instead of backporting #667)

AshishThakur commented 4 years ago

Facing a similar issue while we upgraded to 1.12. As per suggestions in the issue we downgraded dev cluster to 1.5.3 and things started working fine..whereas on prod cluster we are seeing slowness in DNS resolution with 1.5.3 as well.

cemo commented 4 years ago

I think it is good to send an email regarding issue and workaround for now. @mogren Unfortunately we spent a lot of time in last 2 weeks regarding this issue. It would be awesome to be notified about it. The affect of the issue is enormous in small clusters.

ueokande commented 4 years ago

Today, we created a new EKS cluster, and amazon-k8s-cni:v1.5.3 is deployed. Our cluster is now fine!

mprenditore commented 4 years ago

Faced the same issue. Upgraded from 1.5.3 to 1.5.4 started to create some issues, a lot of 504. Reverting back to 1.5.3 wasn't enough, we needed to restart all the cluster nodes in order to be back on fully functionality. Probably a full restart with 1.5.4 could have worked too based on what other people said here that there are no huge changes. But even in that case, the upgrade to 1.5.3 from 1.2.1 didn't created any issue.

mogren commented 4 years ago

Please try the v1.5.5 release candidate if you need g4, m5dn, r5dn or Kubernetes 1.16 support.

daviddelucca commented 4 years ago

@MartiUK How did you downgrade amazon-k8s-cni? Could you show me the steps, please?

chadlwilson commented 4 years ago

@daviddelucca Replacing region below with whatever is appropriate for you...

kubectl set image daemonset.apps/aws-node \
  -n kube-system \
  aws-node=602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon-k8s-cni:v1.5.3

And then it seems restarting all pods at minimum is required. Some seem to have restarted all nodes (which would restart the pods by side effect), but it's unclear if that's really required.

daviddelucca commented 4 years ago

@chadlwilson thank you very much

mogren commented 4 years ago

v1.5.5 is released with a revert of the commit that caused issues. Resolving this issue.

wadey commented 4 years ago

Unless I'm misunderstanding, it looks like v1.6.0-rc4 also has the problematic commit. Can we get a v1.6.0-rc5 with the fix there as well?

eladazary commented 4 years ago

I'm facing this issue since yesterday with CNI 1.5.5, I've tried to downgrade to 1.5.3 and 1.5.5 but with no success. It looks like the /etc/cni/net.d/10-aws.conflist file gets created only when using CNI v1.5.1.

Errors from ipamd.log: Starting L-IPAMD v1.5.5 ... 2019-11-26T16:31:57.105Z [INFO] Testing communication with server 2019-11-26T16:32:27.106Z [INFO] Failed to communicate with K8S Server. Please check instance security groups or http proxy setting 2019-11-26T16:32:27.106Z [ERROR] Failed to create client: error communicating with apiserver: Get https://172.20.0.1:443/version?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout

I saw that after I've upgraded to CNI 1.5.5 again the file /etc/cni/10-aws.conflist got created, maybe is something with the path kubelet is looking for the cni file?

Nodes are in Ready status but all pods are in ContainerCreating state.

Do you have any idea why does it happen?

mogren commented 4 years ago

@wadey The issue is not in v1.6.0-rc4, there we solved it in another way, see #688. This is a better solution since if we return an error when we try to delete a pod that was never created, kubelet will retry 10 times trying to delete something that doesn't exist before giving up.

mogren commented 4 years ago

@eladazary The error you are seeing is unrelated to this issue. Starting with v1.5.3, we don't make the node active until ipamd can talk to the API server. If permissions are not correct and ipamd (aws-node pods) can't talk to the API server or to the EC2 control plane, it can't attach IPs to the nodes and then pods will never get IPs and become active.

Make sure that the worker nodes are configured correctly. The logs for ipamd should tell you what the issue is, they can be found in /var/log/aws-routed-eni/ on the node.

More about worker nodes: https://docs.aws.amazon.com/eks/latest/userguide/launch-workers.html

itsLucario commented 3 years ago

Similar issue came up with 1.7.5 on upgrading from 1.6.1. Around 10% of the pods are able to communicate with each other and others are failing.

Even downgrading to 1.6.1 didn't work until we restated the nodes. Can someone brief the cause and the status of the solution for this?

jayanthvn commented 3 years ago

Hi @itsLucario

When you upgraded was it just an image update or you reapplied the config (https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.7.5/config/v1.7/aws-k8s-cni.yaml) ?

itsLucario commented 3 years ago

@jayanthvn I have applied the exact config yaml which you have shared. Also since we are using CNI custom networking. Once the daemonset is updated we run:

kubectl set env daemonset aws-node -n kube-system AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true

Edit: While updating CNI itself if I set the AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true in the container env then the upgrade happens seamlessly.

I think docs must be updated mentioning if custom configuration is there then update manifests respectively before upgrading.

jayanthvn commented 3 years ago

Hi @itsLucario

Yes that makes sense and thanks for checking. Even I was suspecting that is what is happening hence wanted to know how you upgraded. Can you please open an issue for documentation? I can take care of it.

Thanks.

aws / amazon-vpc-cni-k8s

Upgrade CNI version broke pod-to-pod communication within the same worker node #641