awslabs / amazon-eks-ami

Packer configuration for building a custom EKS AMI
https://awslabs.github.io/amazon-eks-ami/
MIT No Attribution
2.45k stars 1.15k forks source link

Kubelet cluster-dns parameter is set incorrectly #222

Closed drewhemm closed 4 years ago

drewhemm commented 5 years ago

What happened: Potentially related to https://github.com/awslabs/amazon-eks-ami/issues/78 and https://github.com/awslabs/amazon-eks-ami/issues/197.

/etc/kubernetes/kubelet/kubelet-config.json has the wrong value for clusterDNS, which is set by bootstrap.sh. This results in pods being unable to resolve DNS names.

What you expected to happen: I would expect to see the correct nameserver in /etc/kubernetes/kubelet/kubelet-config.json and in /etc/resolv.conf on all pods running in the cluster.

How to reproduce it (as minimally and precisely as possible): Create EKS service. My VPC happens to have two subnets (10.50.112.0/26 and 10.9.0.0/18, not sure that is relevant). Create worker instances. Most have the correct nameserver values in /etc/kubernetes/kubelet/kubelet-config.json, but not all. For the worker nodes that have the wrong value, pods cannot resolve DNS queries.

Anything else we need to know?:

Environment:

This issue still exists with the latest AMI code. An instance I created last week from an AMI built from the master branch (https://github.com/awslabs/amazon-eks-ami/commit/6090f200669ba1f76ce68f23e6496b3df9bc588a) has the wrong nameserver address in /etc/kubernetes/kubelet/kubelet-config.json:

...
"clusterDNS": [
  "10.100.0.10"
],
...

The instance is deployed into a VPC with two subnet CIDRs: 10.50.112.0/26 and 10.9.0.0/18.

It does not happen all the time. Other instances created around the same time and from the same AMI have the correct value. It is an intermittent bug that I have encountered on numerous occasions but am yet to find the exact cause.

whereisaaron commented 5 years ago

Ref #220 #221

micahhausler commented 5 years ago

Can you provide a few more details about your VPC setup?

drewhemm commented 5 years ago

I am using CNI custom networking because we have a limited number of IP addresses in the 10.50.112.0/24 CIDR as that subnet is mapped into our corporate IP space. The 10.50.112.x/26 CIDRS are used by the nodes and the 10.9.0.x/18 IPs are used by the pods.

In the instance metadata, I am attaching a secondary network interface as follows:

#!/bin/bash
set -o xtrace
# Secondary ENI for the underlay, code lifted from
# https://stackoverflow.com/questions/19836854/aws-cloudformation-networkinterfaces-in-autoscaling-launchconfig-group
export AWS_DEFAULT_REGION=eu-west-1

# Get the instance ID
INSTANCE_ID=$(curl -sS http://169.254.169.254/latest/meta-data/instance-id)

# And the AZ
AZ=$(curl -sS http://169.254.169.254/latest/meta-data/placement/availability-zone)

# Find the matching underlay subnet for this AZ
SUBNET_ID=$(aws ec2 describe-subnets --subnet-ids {{ underlay_subnets | join(" ")}} --filters Name=availabilityZone,Values=$AZ --query 'Subnets[0].SubnetId' --output text)

# Create a new network interface
ENI_ID=$(aws ec2 create-network-interface --subnet $SUBNET_ID --description 'Secondary ENI' --groups ${UnderlaySecurityGroup} --query 'NetworkInterface.NetworkInterfaceId' --output text)

# and tag it...
aws ec2 create-tags --resources $ENI_ID --tags 'Key=Foo,Value=Bar'

# Disable source dest check
# aws ec2 modify-network-interface-attribute --network-interface-id $ENI_ID --no-source-dest-check --output text

# Attach the interface to the instance
ATTACHMENT_ID=$(aws ec2 attach-network-interface --network-interface-id $ENI_ID --instance-id $INSTANCE_ID --device-index 1 --output text)

# Set the interface to delete upon instance termination
aws ec2 modify-network-interface-attribute --network-interface-id $ENI_ID --attachment AttachmentId=$ATTACHMENT_ID,DeleteOnTermination=true --output text

I am using Ansible and Jinja2 templating to generate the CloudFormation template, hence why there are some {{ variables }} in the code.

drewhemm commented 5 years ago

Could it be due to hitting a throttle on the metadata service?

https://serverfault.com/questions/774552/aws-ec2-instance-metadata-service-fails-to-respond

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html#instancedata-throttling

I suspect the curl requests are coming back empty. I'll be able to prove it once I deploy a custom AMI that echoes out the responses...

drewhemm commented 5 years ago

In the event that a call to the metadata service fails (I recall now that I faced this problem before on an unrelated AWS project a few years back), what would be the ideal thing to do? Exit the script with exit 1 or retry the call up to x times with exponential backoff and exit 1 if it reaches x retries and still fails?

drewhemm commented 5 years ago

I've found a process that iterates through EC2 instances using k8s:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: clusterdns-test
  namespace: default
spec:
  replicas: 100
  selector:
    matchLabels:
      app: clusterdns-test
  template:
    metadata:
      labels:
        app: clusterdns-test
    spec:
      nodeSelector:
        kubernetes.io/role: test
      containers:
        - name: test
          command:
            - sleep
            - "3600"
          image: busybox
          readinessProbe:
            exec:
              command:
                - grep
                - "172.20.0.10"
                - /etc/resolv.conf
            failureThreshold: 1
            initialDelaySeconds: 10
            timeoutSeconds: 2
          resources:
            requests:
              cpu: "1500m" # used to ensure no more than one pod gets scheduled onto a t3.small instance, faster scheduling than using podAntiAffinity
      tolerations:
        - operator: Exists

The readinessProbe is used to identify pods on nodes with the incorrect clusterDNS value. So far, I have been able to observe one faulty instance in over 600 EC2 instances. My real-world observation rate was much higher.

It's possible this would be quicker if I were to do some checking on the node using a script in the userdata rather than relying on k8s; will look into that tomorrow...

drewhemm commented 5 years ago

First instance I spun up today had the wrong clusterDNS. Interestingly, none of the conditionals I put in my custom bootstrap.sh failed:

ZONE=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)
if [ -z "$ZONE" ]; then
    echo "ZONE is empty"
    exit  1
fi

...

MAC=$(curl -s http://169.254.169.254/latest/meta-data/network/interfaces/macs/ | head -n 1 | sed 's/\/$//')
if [ -z "$MAC" ]; then
    echo "MAC is empty"
    exit  1
fi

CIDRS=$(curl -s http://169.254.169.254/latest/meta-data/network/interfaces/macs/$MAC/vpc-ipv4-cidr-blocks)
if [ -z "$CIDRS" ]; then
    echo "CIDRS is empty"
    exit  1
fi

...

INTERNAL_IP=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)
if [ -z "$INTERNAL_IP" ]; then
    echo "INTERNAL_IP is empty"
    exit  1
fi
INSTANCE_TYPE=$(curl -s http://169.254.169.254/latest/meta-data/instance-type)
if [ -z "$INSTANCE_TYPE" ]; then
    echo "INSTANCE_TYPE is empty"
    exit  1
fi

This would suggest that the curl to get the CIDRS is not failing, or if it is failing, it is returning a non-empty value. Still more debugging required...

If I can't find a solution for this today, I'll have to run a patched AMI that forces the DNS_CLUSTER_IP to 172.20.0.10.

drewhemm commented 5 years ago

At last!

I found the best way to find instances with incorrect clusterDNS was to add the following to the instance userdata:

# Get the instance ID
INSTANCE_ID=$(curl -sS http://169.254.169.254/latest/meta-data/instance-id)
...
# Check the clusterDNS
if [[ ! $(grep 172.20.0.10 /etc/kubernetes/kubelet/kubelet-config.json) ]]; then
  curl -sSX POST -H 'Content-type: application/json' --data "{'text':'Bad clusterDNS on $INSTANCE_ID'}" https://hooks.slack.com/services/############################
fi

This sends me a Slack notification for any instance where the clusterDNS is not as it should be. Error rate was approximately 1-2 per 50 instances.

The problem is caused by a 404 being returned by the curl to get the CIDRS. This results in the following output:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>404 - Not Found</title>
 </head>
 <body>
  <h1>404 - Not Found</h1>
 </body>
</html>

The call to get the MAC addresses is somehow clashing with the adding of the secondary interface in the userdata (a requirement for having separate IP space for worker nodes and pods using custom CNI networking), and instead of returning the MAC address for eth0 (or eth0 and eth1) it sometimes only returns the MAC for eth1, which sometimes has not yet been assigned an IP address (I confirmed this by adding echo "MAC: $MAC" and echo "CIDRS: $CIDRS" to bootstrap.sh to see what is returned by the corresponding curl requests).

If the goal is to get the eth0 MAC address (which the head -n 1 would suggest), it would be more reliable to get it from cat /sys/class/net/eth0/address. It doesn't make much sense to go the API for information that is statically available inside the instance.

If ever the need arises to retrieve multiple MAC addresses, this can be done with cat /sys/class/net/*/address or cat /sys/class/net/eth*/address.

I will open a new PR...

drewhemm commented 5 years ago

A workaround is to run bootstrap.sh before adding the secondary interface, but that doesn't actually fix the issue.

mogren commented 5 years ago

Great find, thanks for investigating this issue. This might cause issues for the CNI as well...

mogren commented 4 years ago

This was resolved in #226