aws / amazon-vpc-cni-k8s

Networking plugin repository for pod networking in Kubernetes using Elastic Network Interfaces on AWS
Apache License 2.0
2.28k stars 741 forks source link

using `amazon-vpc-cni-k8s` outside eks #2839

Open is-it-ayush opened 8 months ago

is-it-ayush commented 8 months ago

What happened:

Hi! I have an ec2 instance & containerd as the container runtime inside a private subnet (which has outbound internet access) in ap-south-1. I have intialized a new cluster with kubeadm init on this master node. It ran successfully. I then wanted to install amazon-vpc-cni as the network manager for my k8s cluster. I ran kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/master/config/master/aws-k8s-cni.yaml and checked the pods in kubectl get pods -n kube-system. One of the pod created by amazon-vpc-cni-k8s named aws-node-xxxx throws an error when trying to initialise. I did kubectl describe pod aws-node-xxx -n kube-system and I get the following.

Failed to pull image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.16.4": failed to pull and unpack image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.16.4": failed to resolve reference "amazon-k8s-cni-init:v1.16.4": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credential

I don't understand why this fails. Is it not possible to use amazon-vpc-cni outside eks in self managed cluster? I also looked around here in issues & it seems like other people had this issue before but I was unable to resolve it myself. Here is my policy k8s_master_ecr inside a k8s_master role which is connected to this master instance via an instance profile,

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "K8sECR",
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken",
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:GetRepositoryPolicy",
                "ecr:DescribeRepositories",
                "ecr:ListImages",
                "ecr:BatchGetImage"
            ],
            "Resource": "*"
        }
    ]
}

Environment:

kwohlfahrt commented 8 months ago

We are running the AWS CNI outside of EKS. We also have the AWS credential provider installed, this allows the kubelet to use the instance credentials to pull from private ECR registries. Before Kubernetes 1.28 (I think, might be off by a version), this functionality was bundled as part of the kubelet.

is-it-ayush commented 8 months ago

That's intresting @kwohlfahrt! I've never used aws-credential-provider. After reading into it, I have a few questions,

kwohlfahrt commented 8 months ago

Should I just deploy it by applying all the files with kubectl apply -f listed here on github.com/kubernetes/cloud-provider-aws/tree/master/examples/existing-cluster/base.

AFAIK, the credential provider can't be installed by applying manifests, it must be installed to your node, since you must change the kubelet flags to use it. The binary and configuration must be placed on disk, and then the kubelet's flags have to be modified to point to the configuration, and the path to search for the binary. This is documented on this page, which also includes an example config.

Where do I get the binary aws-credential-provider?

Pre-built binaries can be found here (source)

Does it work with containerd?

Yes, we've used it with containerd in the past, though we are using cri-o now. AFAIK, the container runtime never interacts with the credential provider directly - the credential provider is called by the kubelet, which then passes the received credentials on to your container runtime. So it shouldn't matter whether you are using containerd, crio, etc.

is-it-ayush commented 8 months ago

Thank you so much @kwohlfahrt! I was able to follow through and resolve this and all the pods are successfully running now. These are the steps I took,

github-actions[bot] commented 8 months ago

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one.

is-it-ayush commented 8 months ago

Hey @kwohlfahrt! It seems this wasn't resolved entirely. As soon as I joined another node I ran into troubles with aws-node pod failing to communicate with ipam from aws-vpc-cni but the logs from ipam didn't indicate any errors so I was unable to understand what's wrong. The setup hasn't changed & I only added one worker (1 master [10.0.32.163], 1 worker [10.0.32.104]) Here's a few outputs from my master node,

I did assign ec2:CreateTags permission which seemed missing & I recreated my entire cluster. The rediness and liveness probes still throw same x.x.x.x:xxx -> 10.x.0.x:53 errors and coredns s unable to get ready.

kwohlfahrt commented 8 months ago

Hm, I'm not sure. My only suspicion is you might be hitting #2840 I reported the other day.

You can easily check by connecting to your node and seeing if /run/xtables.lock is a directory - it should be a file. If it is created as a directory, it causes kube-proxy to fail, which means the CNI cannot reach the API server.

You can see the linked PR in that issue for the fix (the volume needs to be defined with type: FileOrCreate), just make sure to SSH to the node and rmdir /run/xtables.lock after applying the fix.

is-it-ayush commented 7 months ago

Thank You @kwohlfahrt! I had some missing IAM permissions which I added to master node. It seems though it still hasn't really resolved the problem where "coredns" isn't not being reached apparent from the logs when running kubectl logs coredns-76f75df574-49gs5 -n kube-system. I'm not entirely sure what's causing this.

[ERROR] plugin/errors: 2 4999722014791650549.7690820414208347954. HINFO: read udp 10.0.43.148:57589->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 4999722014791650549.7690820414208347954. HINFO: read udp 10.0.43.148:38940->10.0.0.2:53: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
is-it-ayush commented 7 months ago

Update! I was really unable to resolve coredns issues with aws-vpc-cni & aws-cloud-controller-manager. There are multiple issues,

I switched to cilium and let go of my dream to connect k8s and aws.

orsenthil commented 6 months ago

[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout

This seems like the coredns pod go the ip-ddress, but it wasn't able to communicate with the API server, due to missing permissions? The nodes/pods should have the ability to communicate with API server with the necessary permissions.

Were you able to narrow down to any permission issue?

is-it-ayush commented 6 months ago

[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout

This seems like the coredns pod go the ip-ddress, but it wasn't able to communicate with the API server, due to missing permissions? The nodes/pods should have the ability to communicate with API server with the necessary permissions.

Were you able to narrow down to any permission issue?

Not really! I really did all I could and scanned all of journalctl to find something. I wrote about it here & I couldn't get aws-vpc-cni working as far as I remember. I double checked permissions and instance roles but it didn't seem like they were a problem.

It seems like both of them are broken. The controller-manager fails to get providerId from aws cloud for nodes in random order even if you set the hostname to private IPV4 DNS name and add the correct tags. Failing to initialise newly joined nodes or even the master node itself as this leads to the worker nodes getting deleted and master node tainted as NotReady. The coredns pod fails to run regardless of the first issue and there is no way to debug why. The logs collected by /opt/cni/bin/aws-cni-support.sh are not enough to debug the coredns problem.

terryjix commented 6 months ago

I am hitting the same issue. the Pod cannot communicate with any endpoints including

orsenthil commented 6 months ago

@terryjix - This is question on setting up VPC CNI on a non EKS cluster. How did you go about with this?

orsenthil commented 5 months ago

Closing this due to lack of more information.

github-actions[bot] commented 5 months ago

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one.

wtvamp commented 1 month ago

This issue needs to be reopened - it seems to be a fairly ubiquitous issue when attempting to use the amazon-vpc-cni in a non-EKS environment.

I've also encountered it (coredns not able to communicate):

[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API .:53 [INFO] plugin/reload: Running configuration SHA512 = 591cf328cccc12bc490481273e738df59329c62c0b729d94e8b61db9961c2fa5f046dd37f1cf888b953814040d180f52594972691cd6ff41be96639138a43908 CoreDNS-1.11.3 linux/amd64, go1.21.11, a6338e9 [ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:57241->10.0.0.2:53: i/o timeout [ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:42295->10.0.0.2:53: i/o timeout [ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:33996->10.0.0.2:53: i/o timeout [ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:50361->10.0.0.2:53: i/o timeout [ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:58932->10.0.0.2:53: i/o timeout [ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:35147->10.0.0.2:53: i/o timeout [ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:47365->10.0.0.2:53: i/o timeout [ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:60287->10.0.0.2:53: i/o timeout [INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout [INFO] plugin/kubernetes: Trace[2115550610]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229 (16-Sep-2024 19:24:38.357) (total time: 30000ms): Trace[2115550610]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (19:25:08.358) Trace[2115550610]: [30.000916518s] [30.000916518s] END [ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch v1.Namespace: failed to list v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout [INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout [INFO] plugin/kubernetes: Trace[935094613]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229 (16-Sep-2024 19:24:38.358) (total time: 30000ms): Trace[935094613]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (19:25:08.358) Trace[935094613]: [30.000403807s] [30.000403807s] END [ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch v1.Service: failed to list v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout [INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout [INFO] plugin/kubernetes: Trace[1423531700]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229 (16-Sep-2024 19:24:38.358) (total time: 30000ms): Trace[1423531700]: ---"Objects listed" error:Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (19:25:08.359) Trace[1423531700]: [30.000293311s] [30.000293311s] END [ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch v1.EndpointSlice: failed to list v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout [ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:44224->10.0.0.2:53: i/o timeout [ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:60914->10.0.0.2:53: i/o timeout [INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout [INFO] plugin/kubernetes: Trace[1341126722]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229 (16-Sep-2024 19:25:09.591) (total time: 30000ms): Trace[1341126722]: ---"Objects listed" error:Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (19:25:39.592) Trace[1341126722]: [30.000759936s] [30.000759936s] END [ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch v1.EndpointSlice: failed to list v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout [INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout [INFO] plugin/kubernetes: Trace[1646410435]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229 (16-Sep-2024 19:25:09.695) (total time: 30001ms): Trace[1646410435]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30001ms (19:25:39.696) Trace[1646410435]: [30.001364482s] [30.001364482s] END [ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch v1.Namespace: failed to list v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout [INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout [INFO] plugin/kubernetes: Trace[1072212733]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229 (16-Sep-2024 19:25:09.753) (total time: 30000ms): Trace[1072212733]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (19:25:39.754) Trace[1072212733]: [30.000533915s] [30.000533915s] END [ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch v1.Service: failed to list v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout

wtvamp commented 1 month ago

Closing this due to lack of more information.

@orsenthil Why was this closed? It seems like there's plenty of information and repro steps?

orsenthil commented 1 month ago

fairly ubiquitous issue when attempting to use the amazon-vpc-cni in a non-EKS environment.

We will need to reproduce this and investigate. Re-opened.

wtvamp commented 1 month ago

Thanks!

I've got a cluster that reproduces and willing to screen share/support as needed.

terryjix commented 1 month ago

I've fixed my issue by running vpc-cni-k8s on EKS optimized AMI. vpc-cni-k8s plugin conflicts with ec2-net-utils. ec2-net-utils adds more route rules which broke the pod to pod communication in my case. the EKS optimized ami has optimized this issue.

wtvamp commented 1 month ago

I've fixed my issue by running vpc-cni-k8s on EKS optimized AMI. vpc-cni-k8s plugin conflicts with ec2-net-utils. ec2-net-utils adds more route rules which broke the pod to pod communication in my case. the EKS optimized ami has optimized this issue.

Does this work for even outside EKS? I think this bug was for outside EKS (for example, I'm running self-managed on ubuntu AMIs with kubeadm)

terryjix commented 1 month ago

yes, I used kubeadmin to create kubernetes cluster on Amazon Linux 2 ami and found the pod cannot communicate with outside. some strange rules created on route table which overwrites the rules vpc-cni created.

You can find optimized ubuntu ami from https://cloud-images.ubuntu.com/aws-eks/ . Maybe it can fix your issue. You can build your self-managed kubernetes control plan on these amis. The optimized ami has disabled some services may affect network configuration in the OS.

wtvamp commented 1 month ago

yes, I used kubeadmin to create kubernetes cluster on Amazon Linux 2 ami and found the pod cannot communicate with outside. some strange rules created on route table which overwrites the rules vpc-cni created.

You can find optimized ubuntu ami from https://cloud-images.ubuntu.com/aws-eks/ . Maybe it can fix your issue. You can build your self-managed kubernetes control plan on these amis. The optimized ami has disabled some services may affect network configuration in the OS.

It says clearly on the page: These images are customised specifically for the EKS service, and are not intended as general OS images.