awslabs / amazon-eks-ami

Packer configuration for building a custom EKS AMI
https://awslabs.github.io/amazon-eks-ami/
MIT No Attribution
2.42k stars 1.14k forks source link

CA Certificate used to pull images from private repo not used by kubelet after v20240202 #1800

Closed dtledev closed 4 months ago

dtledev commented 4 months ago

What happened:

Sample error:

 Type     Reason                  Age                    From               Message
  ----     ------                  ----                   ----               -------
  Normal   Scheduled               4m25s                  default-scheduler  Successfully assigned aws-load-balancer-controller/aws-load-balancer-controller-6d947b5655-rnvzw to ip-10-128-26-140.ca-central-1.compute.internal
  Warning  FailedCreatePodSandBox  4m24s                  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e3584ead4e35206f99dafb85b2e5831b28c8996698ba2deb727df66c63a4c261": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:50051: connect: connection refused"
  Normal   Pulling                 2m36s (x4 over 4m13s)  kubelet            Pulling image "quay.ourprivatedomain.com/eks/aws-load-balancer-controller:v2.6.0"
  Warning  Failed                  2m36s (x4 over 4m13s)  kubelet            Failed to pull image "quay.ourprivatedomain.com/eks/aws-load-balancer-controller:v2.6.0": rpc error: code = Unknown desc = failed to pull and unpack image "quay.ourprivatedomain.com/eks/aws-load-balancer-controller:v2.6.0": failed to resolve reference "quay.prod-openshift-na.hybrid.sunlifecorp.com/eks/aws-load-balancer-controller:v2.6.0": failed to do request: Head "https://quay.ourprivatedomain.com/v2/eks/aws-load-balancer-controller/manifests/v2.6.0": tls: failed to verify certificate: x509: certificate signed by unknown authority
  Warning  Failed                  2m36s (x4 over 4m13s)  kubelet            Error: ErrImagePull
  Normal   BackOff                 2m21s (x6 over 4m12s)  kubelet            Back-off pulling image "quay.prod-openshift-na.hybrid.sunlifecorp.com/eks/aws-load-balancer-controller:v2.6.0"
  Warning  Failed                  2m21s (x6 over 4m12s)  kubelet            Error: ImagePullBackOff

What you expected to happen: Expect the kubelet to respect the trusted CAs that's on the node.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: Troubleshooted by logging into the node and executing commands via systems manager

Environment:

BASE_AMI_ID="ami-059705a71ed021143"
BUILD_TIME="Fri Feb  2 16:56:07 UTC 2024"
BUILD_KERNEL="5.10.205-195.807.amzn2.x86_64"
ARCH="x86_64"
cartermckinnon commented 4 months ago

What does the user data log look like? /var/log/cloud-init-output.log

dtledev commented 4 months ago

Looks like it ran successfully, I can confirm that it is able to download the .crt file and run the update-ca-trust.

I can see the certificates loaded in the file /etc/ssl/certs/ca-bundle.crt Which is technically a symlink to /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem

cartermckinnon commented 4 months ago

Do you restart containerd after you update-ca-trust?

dtledev commented 4 months ago

restarting containerd after update-ca-trust seems to resolve it, thank you! This makes sense, perhaps something in the order or containerd between those releases.