aws-samples / amazon-eks-custom-amis

Amazon EKS custom AMIs based on Amazon Linux
MIT No Attribution
187 stars 111 forks source link

Error with containerd and EKS 1.21 #45

Closed ichasco-heytrade closed 1 year ago

ichasco-heytrade commented 3 years ago

What happened:

With EKS 1.21 AMI if you want to use containerd option, it will fail because with this option sysctl_entry "net.ipv4.ip_forward = 0 all the deployed pods will not have access to the network.

With docker there isn't any problem.

Environment:

voidlily commented 1 year ago

Are you still encountering this issue on 1.23?

voidlily commented 1 year ago

I was able to reproduce this issue on 1.23 as well

voidlily commented 1 year ago

Something I haven't yet determined when running bootstrap.sh in docker mode is setting net.ipv4.ip_forward=1 already, while it isn't in containerd mode, which means this control was never doing anything in the first place originally. I'm mitigating this in the meantime by just commenting out this control, as it was getting canceled out later on in the past when using docker.

rothgar commented 1 year ago

I'm going to try to reproduce this in all our supported EKS versions. Do you have an example of how you're creating the cluster/nodes (eg eksctl config, terraform)?

voidlily commented 1 year ago

I was using terraform for the nodes. If you're able to reproduce it, you should see nodes running in docker mode

# cat /proc/sys/net/ipv4/ip_forward
1

and in containerd mode

# cat /proc/sys/net/ipv4/ip_forward
0
rothgar commented 1 year ago

I provisioned a cluster with this config using eksctl

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: container-runtime-test
  region: us-east-2

nodeGroups:
  - name: ng-1
    instanceType: m5.xlarge
    desiredCapacity: 2
    amiFamily: AmazonLinux2
    containerRuntime: containerd

and verified the version

k version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.26.1
Kustomize Version: v4.5.7
Server Version: v1.23.14-eks-ffeb93d

I SSHd to one of the nodes and checked containerd was running

sudo systemctl status containerd                                                                                                                                                                           
● containerd.service - containerd container runtime                                                                                                                                                                
   Loaded: loaded (/usr/lib/systemd/system/containerd.service; enabled; vendor preset: disabled)                                                                                                                   
  Drop-In: /etc/systemd/system/containerd.service.d                                                                                                                                                                
           └─10-compat-symlink.conf                                                                                                                                                                                
   Active: active (running) since Thu 2023-01-19 22:02:35 UTC; 1h 34min ago                                                                                                                                        
     Docs: https://containerd.io                    
 Main PID: 3022 (containerd)

and the OS version

NAME="Amazon Linux"                                                                                                                                                                                                
VERSION="2"                                                                                                                                                                                                        
ID="amzn"                                                                                                                                                                                                          
ID_LIKE="centos rhel fedora"                                                                                                                                                                                       
VERSION_ID="2"                                                                                                                                                                                                     
PRETTY_NAME="Amazon Linux 2"

I verified the launch template was using AMI ami-097b4903ba6f2b624 which is the latest EKS 1.23 AL2 AMI in us-east-2

aws ssm get-parameter --name /aws/service/eks/optimized-ami/1.23/amazon-linux-2/recommended/image_id --query "Parameter.Value" --output text
ami-097b4903ba6f2b624

and ip_forwarding is set correctly

cat /proc/sys/net/ipv4/ip_forward                                                                                                                                                 
1

I'm going to try again with a 1.22 cluster and see if I get the same results.

rothgar commented 1 year ago

Verified ip_forward was set on a 1.22 cluster using AMI ami-09ae6038e08d7e8ba which is the latest 1.22 AMI in us-east-2.

Verified on a 1.21 cluster with ami-021b765d61a4b649f (latest 1.21 AL2 AMI) in us-east-1 and ip_forward was set with containerd running.

It's possible eksctl is doing something extra with the node groups but I'd have to dig into it. If you have an AMI ID and region I can test with that would be helpful to verify.

bryantbiggs commented 1 year ago

It's possible eksctl is doing something extra with the node groups but I'd have to dig into it.

If it is, it would be in the user data most likely.

kalgopa commented 1 year ago

We had similar issue. Environment: OS: AmazonLinux OS Version: 2 EKS Version: 1.21 and 1.22

We implemented STIG config for ip_forward using -

#Set OS to not perform packet forwarding unless system is a router, V-204625
function V204625() {
    local Regex1="^(\s*)#net.ipv4.ip_forward\s+\S+(\s*#.*)?\s*$"
    local Regex2="s/^(\s*)#net.ipv4.ip_forward\s+\S+(\s*#.*)?\s*$/\net.ipv4.ip_forward = 0\2/"
    local Regex3="^(\s*)net.ipv4.ip_forward\s+\S+(\s*#.*)?\s*$"
    local Regex4="s/^(\s*)net.ipv4.ip_forward\s+\S+(\s*#.*)?\s*$/\net.ipv4.ip_forward = 0\2/"
    local Regex5="^(\s*)net.ipv4.ip_forward\s*=\s*0?\s*$"
    local Success="Set system to not perform package forwarding, per V-204625."
    local Failure="Failed to set the system to not perform package forwarding, not in compliance V-204625."

    echo
    ( (grep -E -q "${Regex1}" /etc/sysctl.conf && sed -ri "${Regex2}" /etc/sysctl.conf) || (grep -E -q "${Regex3}" /etc/sysctl.conf && sed -ri "${Regex4}" /etc/sysctl.conf)) || echo "net.ipv4.ip_forward = 0" >>/etc/sysctl.conf
    (grep -E -q "${Regex5}" /etc/sysctl.conf && echo "${Success}") || {
        echo "${Failure}"
        exit 1
    }
}

With this we had - docker mode

# cat /proc/sys/net/ipv4/ip_forward
1

containerd mode

# cat /proc/sys/net/ipv4/ip_forward
0

Resolution: We removed ip_forward (V-204625) implementation. Result: docker mode

# cat /proc/sys/net/ipv4/ip_forward
1

containerd mode

# cat /proc/sys/net/ipv4/ip_forward
1

For cis-benchmark.sh file you can comment these lines: https://github.com/aws-samples/amazon-eks-custom-amis/blob/main/scripts/cis-benchmark.sh#L329-L331

rothgar commented 1 year ago

Thanks for the extra info @kalgopa

My assumption is the systems that are having this issue are using custom built AMIs which are not performing the necessary steps to enable ip_forwarding with non-docker container runtimes even if they're based on the EKS provided AMIs. If anyone on this ticket has an Amazon published AMI that's experiencing this problems or terraform code to create a cluster please let me know so I can reproduce the problem.

bryantbiggs commented 1 year ago

closing for now - @kalgopa please let us know if there is additional info and we can take another look