aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

[EKS]: Support for Arm Nodes #264

Closed tabern closed 4 years ago

tabern commented 5 years ago

Amazon EKS now supports Arm processor EC2 A1 instances as a developer preview. You can now run containers using EC2 A1 instances on a Kubernetes cluster that is managed by Amazon EKS.

Learn more and get started here: https://github.com/aws/containers-roadmap/tree/master/preview-programs/

Learn more about Amazon A1 instances: https://aws.amazon.com/ec2/instance-types/a1/

Please leave feedback and comments on the preview using this ticket.

t0ny-peng commented 5 years ago

Thank you bring in Arm support! I tried out the article. Here's my feedback.

  1. After step 6, I do see the Arm64 node becomes READY. However, it took several try for the pod aws-node-arm-d788f to start. Even though, it keeps crashing restarting. Here's the event of that pod.

    Events:
      Type     Reason     Age               From                                                  Message
      ----     ------     ----              ----                                                  -------
      Normal   Scheduled  9m                default-scheduler                                     Successfully assigned kube-system/aws-node-arm-d788f to ip-172-31-36-179.us-west-2.compute.internal
      Normal   Created    6m (x4 over 8m)   kubelet, ip-172-31-36-179.us-west-2.compute.internal  Created container
      Normal   Started    6m (x4 over 8m)   kubelet, ip-172-31-36-179.us-west-2.compute.internal  Started container
      Normal   Pulling    5m (x5 over 9m)   kubelet, ip-172-31-36-179.us-west-2.compute.internal  pulling image "940911992744.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-arm64:v1.3.3"
      Normal   Pulled     5m (x5 over 9m)   kubelet, ip-172-31-36-179.us-west-2.compute.internal  Successfully pulled image "940911992744.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-arm64:v1.3.3"
      Warning  BackOff    4m (x12 over 7m)  kubelet, ip-172-31-36-179.us-west-2.compute.internal  Back-off restarting failed container

    I uploaded the whole description to S3: https://s3-us-west-2.amazonaws.com/public.apex.ai/aws-node-arm64-pod.txt

  2. Didn't find anything wrong with the Arm node it self. Here's the description: https://s3-us-west-2.amazonaws.com/public.apex.ai/arm64-node.txt

  3. I tried to start a ubuntu that runs on this Arm64(using NodeSelector), but it fail to start. Events show these:

    Events:
      Type     Reason                  Age                From                                                  Message
      ----     ------                  ----               ----                                                  -------
      Normal   Scheduled               19m                default-scheduler                                     Successfully assigned default/ubuntu-arm64-sample-7dc4c76c4f-fkl4r to ip-172-31-36-179.us-west-2.compute.internal
      Warning  FailedCreatePodSandBox  19m                kubelet, ip-172-31-36-179.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "70b6cef67df071b8a488bdbf49cda1ffae41e809c149e6c47108d61e784f15c3" network for pod "ubuntu-arm64-sample-7dc4c76c4f-fkl4r": NetworkPlugin cni failed to set up pod "ubuntu-arm64-sample-7dc4c76c4f-fkl4r_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "70b6cef67df071b8a488bdbf49cda1ffae41e809c149e6c47108d61e784f15c3" network for pod "ubuntu-arm64-sample-7dc4c76c4f-fkl4r": NetworkPlugin cni failed to teardown pod "ubuntu-arm64-sample-7dc4c76c4f-fkl4r_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
      Normal   SandboxChanged          4m (x71 over 19m)  kubelet, ip-172-31-36-179.us-west-2.compute.internal  Pod sandbox changed, it will be killed and re-created.

    Full description can be found here: https://s3-us-west-2.amazonaws.com/public.apex.ai/aws-ubuntu.txt

pablomfc commented 5 years ago

Hello,

My node is not becoming ready. It's looks like an issue with ecr.

I'm using eks 1.12.

NAME                                                          READY   STATUS              RESTARTS   AGE
aws-node-arm-cd7x6                                            0/1     ContainerCreating   0          24m

kubectl describe pod aws-node-arm-cd7x6

... Warning FailedCreatePodSandBox 41s (x119 over 25m) kubelet, ip-xx-xx-xx-xx.us-east-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = Error response from daemon: pull access denied for 602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause-arm64, repository does not exist or may require 'docker login' ...

tabern commented 5 years ago

@pablomfc have you checked your IAM roles for the node? Node role could be missing the AmazonEC2ContainerRegistryPowerUser policy, or something similar, which allows access to ECR for pulling the image.

pablomfc commented 5 years ago

Thanks for the suggestion @tabern, actually I use "AmazonEC2ContainerRegistryReadOnly". This A1 instance node is running alongside other amd64 nodes, and share the same IAM Role.

pablomfc commented 5 years ago

I figured out the problem !!

As I'm using an existing cluster to run the A1 instance I forget to include the BootstrapArguments : --pause-container-account 940911992744 for bootstrap.sh.

After that I found another problem. The repository for 940911992744.dkr.ecr.us-east-2.amazonaws.com/eks/pause-arm64 is absent at Region us-east-2 (where my cluster is located)

Looking at https://github.com/awslabs/amazon-eks-ami/blob/16bd0311c069f4b70a10205211b41845e59259d7/files/bootstrap.sh#L198cound I realize that I needed to update the file: /etc/systemd/system/kubelet.service.d/10-kubelet-args.conf at node from:

[Service]
Environment='KUBELET_ARGS=--node-ip=xx.xx.xx.xx --pod-infra-container-image=940911992744.dkr.ecr.us-east-2.amazonaws.com/eks/pause-arm64:3.1'

to

[Service]
Environment='KUBELET_ARGS=--node-ip=xx.xx.xx.xx --pod-infra-container-image=940911992744.dkr.ecr.us-west-2.amazonaws.com/eks/pause-arm64:3.1'

systemctl daemon-reload systemctl restart kubelet

pablomfc commented 5 years ago

For a quick fix I put this after the bootstrap.sh at user_data:

INTERNAL_IP=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)

cat <<EOF > /etc/systemd/system/kubelet.service.d/10-kubelet-args.conf
[Service]
Environment='KUBELET_ARGS=--node-ip=$INTERNAL_IP --pod-infra-container-image=940911992744.dkr.ecr.us-west-2.amazonaws.com/eks/pause-arm64:3.1'
EOF

systemctl daemon-reload
systemctl restart kubelet
scraton commented 5 years ago

I had the same issue as @pablomfc with a cluster in us-east-1. Applying the changes described fixed the issue for the aws-node-arm pods.

However, kube-proxy does not work, and just goes into a crash cycle, seemingly due to it not being compiled for ARM:

$ kubectl logs -f kube-proxy-glxk4
standard_init_linux.go:190: exec user process caused "exec format error"

The container image is pulling from 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.12.6. I've also tried changing the account to 940911992744 but the image does not exist there.

Anyone have any success getting kube-proxy to work?

scraton commented 5 years ago

I was able to get kube-proxy to work on the cluster by executing the following:

kubectl patch pod kube-proxy-glxk4 -p '{"spec":{"containers":[{"image":"k8s.gcr.io/kube-proxy:v1.12.6","name":"kube-proxy"}]}}'

This uses the kube-proxy image from gcr.io rather than Amazon's. The one from gcr.io has multiarch support, unlike AWS's.

This command only patches the specific pod, but you should be able to apply it to the entire DaemonSet if desired. I'm just not sure what modifications AWS has made to the kube-proxy image, so it might not be entirely safe to do that.

mcrute commented 5 years ago

@scraton both coredns and kube-proxy default to the amd64 versions in a default EKS cluster so they will both need to be patched. We build our container images from the upstream code-base without local modifications so you should not be missing anything. I'm in the process of publishing updated container images for ARM as well as doc updates for the beta that should resolve this issue. I'll post back here once those are live.

Note that because ECR does not yet support multi-architecture images I'll be posting them to repos with a -arm64 suffix, much the way we distribute the CNI image.

zhouziyang commented 5 years ago

@left4taco Hi, I encounter the same issue with yours. did you resolve this issue?

I checked ipamd.log on Arm64 node, and it tells "Failed to create client: error communicating with apiserver: Get https://10.100.0.1:443/version?timeout=32s".

the Amd64 node works fine (can connect to API server), but Arm64 node not. both nodes are in the same VPC (and the same security group) but not same subnets (Availability Zone).

Any idea? Thanks!

t0ny-peng commented 5 years ago

@zhouziyang

I just gave it a try yesterday and still it didn’t work out.

This time I can see the node became ready but aws-node pod running on the aarch64 machine keeps crashing. The CNI plugin can not assign secondary IP to the machine.

zhouziyang commented 5 years ago

I make a DNAT rule for API server (DNAT to it's public IP), and aws-node-arm pods worked (but introduce network issues across nodes). I even make a cni 1.5.0 docker image, but still not work out. I think (as you said), the root cause is related to network connection between VPC and k8s cluster. still investigation~~ @tabern any idea to this issue? or maybe any incorrect config during EKS setup? Thanks!

t0ny-peng commented 5 years ago

@zhouziyang I didn't get it. The aarch64 nodes has exactly the same VPC and security group as the x86_64 node. It has correct arch of CNI 1.3.3 How come that it cannot establish a connection to the control plane?

zhouziyang commented 5 years ago

@left4taco seems missing iptable rules on arm64 node. I restored iptable rules from x86_64 nodes. seems worked!

t0ny-peng commented 5 years ago

@tabern May I ask what's the progress of adding official support of ARM64 machine into EKS? Thanks. Can't wait to see this feature.

tabern commented 5 years ago

Hi @left4taco we're working on solving these issues and updating the preview so it doesn't take so much manual work to get going! Stay tuned.

tabern commented 5 years ago

Here's the issues I've identified from the conversation here, let me know if I'm missing any @left4taco @zhouziyang @scraton @pablomfc

midN commented 5 years ago

Any news on getting AMI for 1.13/1.14?

kazuwal commented 5 years ago

Hello

My A1 instances are stuck in a cycle: Running -> Error -> CrashLoopBackoff

kubectl get pods -n kube-system

aws-node-arm-5rbt7 0/1 CrashLoopBackOff 147 13h

aws-node-arm-pwtv4 0/1 CrashLoopBackOff 146 13h

aws-node-arm-t2pvv 1/1 Running 147 13h

Is this a known issue? I followed the current instructions exactly

lyndon160 commented 5 years ago

Hi,

Great to see Arm support on AWS.

I've just followed the guide for using A1 EKS and have come across an issue when deploying the Redis example.

The pods get stuck on creation and kubectl events show this:

Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "b07c2b2d3ed1ecb8f554a8ec0da7081b570f055e7b8047b4a0da231cbda35dc9" network for pod "redis-slave-kxdps": NetworkPlugin cni failed to set up pod "redis-slave-kxdps_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "b07c2b2d3ed1ecb8f554a8ec0da7081b570f055e7b8047b4a0da231cbda35dc9" network for pod "redis-slave-kxdps": NetworkPlugin cni failed to teardown pod "redis-slave-kxdps_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"

Colleagues have had similar issues. Has this been tested recently and confrimed to still be working?

Thanks

tabern commented 4 years ago

Hey everyone, a few updates here.

  1. We just updated the ARM preview to include support for the AMI SSM parameters and the new M6g Graviton instances! Check it out

  2. We have a new issue template and labels specifically for the ARM preview. PLEASE use this template to create any additional issues for questions or bugs so we can track them and mark them resolved. We'll continue to keep this issue open to track our GA deliverable.

-Nate

rverma-jm commented 4 years ago

Can we use the bottlerocket OS also with ARM? https://aws.amazon.com/jp/about-aws/whats-new/2020/03/announcing-bottlerocket-a-new-open-source-linux-based-operating-system-optimized-to-run-containers/

samuelkarp commented 4 years ago

@rverma-jm Not yet; we're tracking that work in https://github.com/bottlerocket-os/bottlerocket/issues/468.

max-lobur commented 4 years ago

Is 1.15 supported right now? I see official AMIs available: image docs still mention 1.15 is not supported https://docs.aws.amazon.com/eks/latest/userguide/arm-support.html

I'm going to try anyways :D

max-lobur commented 4 years ago

Its actually worked. I used 1.15 yamls where available (only kube-proxy now), and 1.14 for everything else https://github.com/aws/containers-roadmap/tree/master/preview-programs/eks-arm-preview

otterley commented 4 years ago

M6g EC2 instances powered by Graviton2 processors went GA today - it would be useful to publish EKS-optimized AMIs for Arm alongside the AMIs for x86-64. These offer significant cost-optimization opportunities for AWS customers.

tabern commented 4 years ago

@otterley today the ARM preview supports M6 instances! Check it out

We are also working to make this support generally available to all customers and will update this ticket when we launch.

otterley commented 4 years ago

@tabern Thanks for the pointer. Instructions for EKS 1.15/1.16 appear to be missing - is this because the components aren't available (I know the AMI is available), or do the details just need to be updated?

tkinz27 commented 4 years ago

Are mixed x86_64 and arm64 clusters supported?

schjan commented 4 years ago

As ECR now supports manifest lists (#505), it would be awesome to push the EKS Helper Images as multi-arch images. Then mixed clusters would be super easy.

srinivasrb commented 4 years ago

Thanks for bringing support for ARM in EKS. This works well. I am now trying to add Container Insights to monitor an ARM cluster. I followed the instructions in https://docs.aws.amazon.com/eks/latest/userguide/arm-support.html to create a new cluster, and then followed https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-quickstart.html to try and set up Container Insights.

Unfortunately, the CloudWatch Agent and the FluentD agent pods don't seem to be starting up - they get into a CrashLoopBackoff. The events in the pod seem to indicate that the image being used is amazon/cloudwatch-agent:1.231221.0. At least on Docker Hub, this seems to be compiled for x64 and not ARM. Any chance there's an image for this and Fluentd for ARM architecture?

Thanks!

ckdarby commented 4 years ago

@mikestef9 Is there a ticket tracking mixed support of x86_64 and arm64?

mikestef9 commented 4 years ago

@ckdarby this is that ticket – as part of GA support for ARM, we are leveraging recently launched ECR multi-arch feature to allow for heterogenous clusters.

ckdarby commented 4 years ago

@mikestef9 I should have been more specific mixed managed node groups.

mikestef9 commented 4 years ago

@ckdarby We will be launching Arm support for managed node groups as part of the GA launch. Node groups will still a single instance type though, so you can have heterogenous clusters with multiple node groups. Are you asking for a single managed node group with multiple instance types?

ckdarby commented 4 years ago

@mikestef9 Thanks for the update, excited to see GA come :)

Are you asking for a single managed node group with multiple instance types?

Nope, just multiple managed groups with some ARM and some not.

ab77 commented 4 years ago

So while we are waiting for official support, at the time of writing it is possible to create a mixed mode EKS cluster, where some nodegroups run amd64 architecture and some with arm64.

It's pretty straightforward, to do by broadly following AWS guide, with the exception of:

(e.g.) kube-proxy-arm64 manifest

---
kind: DaemonSet
apiVersion: apps/v1
metadata:
  labels:
    k8s-app: kube-proxy-arm64
    eks.amazonaws.com/component: kube-proxy
  name: kube-proxy-arm64
  namespace: kube-system
spec:
  selector:
    matchLabels:
      k8s-app: kube-proxy-arm64
...

Making these changes will ensure when the Graviton arm64 nodes join the EKS cluster, they will have the correct architecture containers deployed to them and become ready.

mikestef9 commented 4 years ago

Amazon EKS support for Arm-based instances is now generally available! See the launch blog and EKS documentation for more details.

Notable updates with general availability include:

abdennour commented 4 years ago

I upgraded to EKS 1.17 to leverage the support of ARM architecture ( r6g.large instances). The instance joins the cluster but when i run "kubectl get nodes", all instances have clear status except the ARM instance which comes with "Unknown" status, "Unknown" name.

shrivastavshubham34 commented 3 years ago

I'm trying to add an ARM nodegroup to the existing EKS cluster with a non-arm node group. After creating though I get this error, image

Tried this by creating unmanaged nodegroups using eksctl still faced the same error. ARM node is in NotReady state