Issue fetching logs for pods running in different CIDR range under one VPC

bhargavamin commented 2 years ago

I'm setting up a dual-stack VPC with multiple CIDRS in which I created an EKS cluster using terraform module and self-managed nodes.

I'm facing a very peculiar issue where I can only do a successful kubectl exec and kubectl logs commands on nodes from a single CIDR range out of 3 CIDRS attached to VPC. I suspect that this issue could be related to iptable rules or vpc cni settings.

CIDR range attached with VPC:

10.21.0.0/16 --- kubectl logs works and kubectl exec work
10.20.0.0/16 --- kubectl logs and kubectl exec doesn't work
10.22.0.0/16 --- kubectl logs and kubectl exec doesn't work

The cluster and nodes are launched in the private subnet. Internet traffic for IPV6 traffic going EIGW and IPV4 traffic going NAT.

I have checked the following things:

Switched EKS cluster to be public as well as public and private, both gave same results.
VPC, NACL, SG, Route Table of the EKS cluster were configured properly
VPC CNI configs for ipv6 set, disabled ipv4 settings
The EKS cluster was installed by default with version 1.10.1 and we updated it to v1.11.2 and still experienced the issue
Port 10250 was listening on the nodes in the affected CIDRs, confirmed it with telnet & netstat
Telnet to port 10250 from node to node works
I confirmed that we could retrieve logs from containers running in the nodes in the affected CIDRs using "docker logs" commands

I ran kubectl command with debug mode and got the error below:

I0705 11:29:09.058276   38265 request.go:1181] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Get \"[https://10.22.252.48:10250/containerLogs/kube-system/aws-node-q8gpp/aws-node?follow=true\ ](https://10.22.252.48:10250/containerLogs/kube-system/aws-node-q8gpp/aws-node?follow=true\)  ": dial tcp 10.22.252.48:10250: i/o timeout","code":500}
I0705 11:29:09.061802   38265 helpers.go:217] server response object: [{
"metadata": {},
"status": "Failure",
"message": "Get \"[https://10.22.252.48:10250/containerLogs/kube-system/aws-node-q8gpp/aws-node?follow=true\ ](https://10.22.252.48:10250/containerLogs/kube-system/aws-node-q8gpp/aws-node?follow=true\)  ": dial tcp 10.22.252.48:10250: i/o timeout",
"code": 500

While executing the command above, we were able to see the error below in CloudWatch Logs:

E0705 09:02:02.217227      10 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"error dialing backend: dial tcp 10.22.252.48:10250: i/o timeout"}: error dialing backend: dial tcp 10.22.252.48:10250: i/o timeou

Versions:

EKS cluster v.1.21
VPC CNI v1.11.2
CoreDns default
kube-proxy default
Node AMI (Default AWS provided - amazon-eks-node-1.21-v20220629)
EKS terraform module version v18.21.0

AWS Support Case ID: 10315548691

If required I can provide debug output of vpc cni troubleshooting script.

jayanthvn commented 2 years ago

Hi, cni is responsible only for setting up routing for pod to pod communication and also any cni setting won't impact this. This doesn't look like a CNI issue. Since you have already opened up a case we will look into why kubectl is not working for additional cidrs.

jayanthvn commented 2 years ago

Since you already have a case opened, we will check fi the CIDR is allow-listed. Will close this issue for now.

github-actions[bot] commented 2 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

bhargavamin commented 2 years ago

I was able to fix the issue. Its a node AMI config issue.

When you have self managed nodes you need to explicitly mention some parameters for the /etc/eks/bootstrap.sh script so that it can support ipv6 eks cluster.

Adding bootstrap_extra_args: "--ip-family ipv6 --service-ipv6-cidr fc00::/7" fixed the issue.

Ref: https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1958

aws / amazon-vpc-cni-k8s

Issue fetching logs for pods running in different CIDR range under one VPC #2022

⚠️COMMENT VISIBILITY WARNING⚠️