Closed tabern closed 4 years ago
Thank you bring in Arm support! I tried out the article. Here's my feedback.
After step 6, I do see the Arm64 node becomes READY. However, it took several try for the pod aws-node-arm-d788f
to start. Even though, it keeps crashing restarting. Here's the event of that pod.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 9m default-scheduler Successfully assigned kube-system/aws-node-arm-d788f to ip-172-31-36-179.us-west-2.compute.internal
Normal Created 6m (x4 over 8m) kubelet, ip-172-31-36-179.us-west-2.compute.internal Created container
Normal Started 6m (x4 over 8m) kubelet, ip-172-31-36-179.us-west-2.compute.internal Started container
Normal Pulling 5m (x5 over 9m) kubelet, ip-172-31-36-179.us-west-2.compute.internal pulling image "940911992744.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-arm64:v1.3.3"
Normal Pulled 5m (x5 over 9m) kubelet, ip-172-31-36-179.us-west-2.compute.internal Successfully pulled image "940911992744.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-arm64:v1.3.3"
Warning BackOff 4m (x12 over 7m) kubelet, ip-172-31-36-179.us-west-2.compute.internal Back-off restarting failed container
I uploaded the whole description to S3: https://s3-us-west-2.amazonaws.com/public.apex.ai/aws-node-arm64-pod.txt
Didn't find anything wrong with the Arm node it self. Here's the description: https://s3-us-west-2.amazonaws.com/public.apex.ai/arm64-node.txt
I tried to start a ubuntu that runs on this Arm64(using NodeSelector), but it fail to start. Events show these:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 19m default-scheduler Successfully assigned default/ubuntu-arm64-sample-7dc4c76c4f-fkl4r to ip-172-31-36-179.us-west-2.compute.internal
Warning FailedCreatePodSandBox 19m kubelet, ip-172-31-36-179.us-west-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "70b6cef67df071b8a488bdbf49cda1ffae41e809c149e6c47108d61e784f15c3" network for pod "ubuntu-arm64-sample-7dc4c76c4f-fkl4r": NetworkPlugin cni failed to set up pod "ubuntu-arm64-sample-7dc4c76c4f-fkl4r_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "70b6cef67df071b8a488bdbf49cda1ffae41e809c149e6c47108d61e784f15c3" network for pod "ubuntu-arm64-sample-7dc4c76c4f-fkl4r": NetworkPlugin cni failed to teardown pod "ubuntu-arm64-sample-7dc4c76c4f-fkl4r_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
Normal SandboxChanged 4m (x71 over 19m) kubelet, ip-172-31-36-179.us-west-2.compute.internal Pod sandbox changed, it will be killed and re-created.
Full description can be found here: https://s3-us-west-2.amazonaws.com/public.apex.ai/aws-ubuntu.txt
Hello,
My node is not becoming ready. It's looks like an issue with ecr.
I'm using eks 1.12.
NAME READY STATUS RESTARTS AGE
aws-node-arm-cd7x6 0/1 ContainerCreating 0 24m
kubectl describe pod aws-node-arm-cd7x6
...
Warning FailedCreatePodSandBox 41s (x119 over 25m) kubelet, ip-xx-xx-xx-xx.us-east-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = Error response from daemon: pull access denied for 602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause-arm64, repository does not exist or may require 'docker login'
...
@pablomfc have you checked your IAM roles for the node? Node role could be missing the AmazonEC2ContainerRegistryPowerUser policy, or something similar, which allows access to ECR for pulling the image.
Thanks for the suggestion @tabern, actually I use "AmazonEC2ContainerRegistryReadOnly". This A1 instance node is running alongside other amd64 nodes, and share the same IAM Role.
I figured out the problem !!
As I'm using an existing cluster to run the A1 instance I forget to include the BootstrapArguments : --pause-container-account 940911992744 for bootstrap.sh.
After that I found another problem. The repository for 940911992744.dkr.ecr.us-east-2.amazonaws.com/eks/pause-arm64 is absent at Region us-east-2 (where my cluster is located)
Looking at https://github.com/awslabs/amazon-eks-ami/blob/16bd0311c069f4b70a10205211b41845e59259d7/files/bootstrap.sh#L198cound I realize that I needed to update the file: /etc/systemd/system/kubelet.service.d/10-kubelet-args.conf at node from:
[Service]
Environment='KUBELET_ARGS=--node-ip=xx.xx.xx.xx --pod-infra-container-image=940911992744.dkr.ecr.us-east-2.amazonaws.com/eks/pause-arm64:3.1'
to
[Service]
Environment='KUBELET_ARGS=--node-ip=xx.xx.xx.xx --pod-infra-container-image=940911992744.dkr.ecr.us-west-2.amazonaws.com/eks/pause-arm64:3.1'
systemctl daemon-reload systemctl restart kubelet
For a quick fix I put this after the bootstrap.sh at user_data:
INTERNAL_IP=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)
cat <<EOF > /etc/systemd/system/kubelet.service.d/10-kubelet-args.conf
[Service]
Environment='KUBELET_ARGS=--node-ip=$INTERNAL_IP --pod-infra-container-image=940911992744.dkr.ecr.us-west-2.amazonaws.com/eks/pause-arm64:3.1'
EOF
systemctl daemon-reload
systemctl restart kubelet
I had the same issue as @pablomfc with a cluster in us-east-1. Applying the changes described fixed the issue for the aws-node-arm pods.
However, kube-proxy does not work, and just goes into a crash cycle, seemingly due to it not being compiled for ARM:
$ kubectl logs -f kube-proxy-glxk4
standard_init_linux.go:190: exec user process caused "exec format error"
The container image is pulling from 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.12.6
. I've also tried changing the account to 940911992744 but the image does not exist there.
Anyone have any success getting kube-proxy to work?
I was able to get kube-proxy to work on the cluster by executing the following:
kubectl patch pod kube-proxy-glxk4 -p '{"spec":{"containers":[{"image":"k8s.gcr.io/kube-proxy:v1.12.6","name":"kube-proxy"}]}}'
This uses the kube-proxy image from gcr.io rather than Amazon's. The one from gcr.io has multiarch support, unlike AWS's.
This command only patches the specific pod, but you should be able to apply it to the entire DaemonSet if desired. I'm just not sure what modifications AWS has made to the kube-proxy image, so it might not be entirely safe to do that.
@scraton both coredns and kube-proxy default to the amd64 versions in a default EKS cluster so they will both need to be patched. We build our container images from the upstream code-base without local modifications so you should not be missing anything. I'm in the process of publishing updated container images for ARM as well as doc updates for the beta that should resolve this issue. I'll post back here once those are live.
Note that because ECR does not yet support multi-architecture images I'll be posting them to repos with a -arm64
suffix, much the way we distribute the CNI image.
@left4taco Hi, I encounter the same issue with yours. did you resolve this issue?
I checked ipamd.log on Arm64 node, and it tells "Failed to create client: error communicating with apiserver: Get https://10.100.0.1:443/version?timeout=32s".
the Amd64 node works fine (can connect to API server), but Arm64 node not. both nodes are in the same VPC (and the same security group) but not same subnets (Availability Zone).
Any idea? Thanks!
@zhouziyang
I just gave it a try yesterday and still it didn’t work out.
This time I can see the node became ready but aws-node pod running on the aarch64 machine keeps crashing. The CNI plugin can not assign secondary IP to the machine.
I make a DNAT rule for API server (DNAT to it's public IP), and aws-node-arm pods worked (but introduce network issues across nodes). I even make a cni 1.5.0 docker image, but still not work out. I think (as you said), the root cause is related to network connection between VPC and k8s cluster. still investigation~~ @tabern any idea to this issue? or maybe any incorrect config during EKS setup? Thanks!
@zhouziyang I didn't get it. The aarch64 nodes has exactly the same VPC and security group as the x86_64 node. It has correct arch of CNI 1.3.3 How come that it cannot establish a connection to the control plane?
@left4taco seems missing iptable rules on arm64 node. I restored iptable rules from x86_64 nodes. seems worked!
@tabern May I ask what's the progress of adding official support of ARM64 machine into EKS? Thanks. Can't wait to see this feature.
Hi @left4taco we're working on solving these issues and updating the preview so it doesn't take so much manual work to get going! Stay tuned.
Here's the issues I've identified from the conversation here, let me know if I'm missing any @left4taco @zhouziyang @scraton @pablomfc
Any news on getting AMI for 1.13/1.14?
Hello
My A1 instances are stuck in a cycle: Running -> Error -> CrashLoopBackoff
kubectl get pods -n kube-system
aws-node-arm-5rbt7 0/1 CrashLoopBackOff 147 13h
aws-node-arm-pwtv4 0/1 CrashLoopBackOff 146 13h
aws-node-arm-t2pvv 1/1 Running 147 13h
Is this a known issue? I followed the current instructions exactly
Hi,
Great to see Arm support on AWS.
I've just followed the guide for using A1 EKS and have come across an issue when deploying the Redis example.
The pods get stuck on creation and kubectl events show this:
Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "b07c2b2d3ed1ecb8f554a8ec0da7081b570f055e7b8047b4a0da231cbda35dc9" network for pod "redis-slave-kxdps": NetworkPlugin cni failed to set up pod "redis-slave-kxdps_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "b07c2b2d3ed1ecb8f554a8ec0da7081b570f055e7b8047b4a0da231cbda35dc9" network for pod "redis-slave-kxdps": NetworkPlugin cni failed to teardown pod "redis-slave-kxdps_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"
Colleagues have had similar issues. Has this been tested recently and confrimed to still be working?
Thanks
Hey everyone, a few updates here.
We just updated the ARM preview to include support for the AMI SSM parameters and the new M6g Graviton instances! Check it out
We have a new issue template and labels specifically for the ARM preview. PLEASE use this template to create any additional issues for questions or bugs so we can track them and mark them resolved. We'll continue to keep this issue open to track our GA deliverable.
-Nate
Can we use the bottlerocket OS also with ARM? https://aws.amazon.com/jp/about-aws/whats-new/2020/03/announcing-bottlerocket-a-new-open-source-linux-based-operating-system-optimized-to-run-containers/
@rverma-jm Not yet; we're tracking that work in https://github.com/bottlerocket-os/bottlerocket/issues/468.
Is 1.15 supported right now? I see official AMIs available: docs still mention 1.15 is not supported https://docs.aws.amazon.com/eks/latest/userguide/arm-support.html
I'm going to try anyways :D
Its actually worked. I used 1.15 yamls where available (only kube-proxy now), and 1.14 for everything else https://github.com/aws/containers-roadmap/tree/master/preview-programs/eks-arm-preview
M6g EC2 instances powered by Graviton2 processors went GA today - it would be useful to publish EKS-optimized AMIs for Arm alongside the AMIs for x86-64. These offer significant cost-optimization opportunities for AWS customers.
@otterley today the ARM preview supports M6 instances! Check it out
We are also working to make this support generally available to all customers and will update this ticket when we launch.
@tabern Thanks for the pointer. Instructions for EKS 1.15/1.16 appear to be missing - is this because the components aren't available (I know the AMI is available), or do the details just need to be updated?
Are mixed x86_64 and arm64 clusters supported?
As ECR now supports manifest lists (#505), it would be awesome to push the EKS Helper Images as multi-arch images. Then mixed clusters would be super easy.
Thanks for bringing support for ARM in EKS. This works well. I am now trying to add Container Insights to monitor an ARM cluster. I followed the instructions in https://docs.aws.amazon.com/eks/latest/userguide/arm-support.html to create a new cluster, and then followed https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-quickstart.html to try and set up Container Insights.
Unfortunately, the CloudWatch Agent and the FluentD agent pods don't seem to be starting up - they get into a CrashLoopBackoff. The events in the pod seem to indicate that the image being used is amazon/cloudwatch-agent:1.231221.0. At least on Docker Hub, this seems to be compiled for x64 and not ARM. Any chance there's an image for this and Fluentd for ARM architecture?
Thanks!
@mikestef9 Is there a ticket tracking mixed support of x86_64 and arm64?
@ckdarby this is that ticket – as part of GA support for ARM, we are leveraging recently launched ECR multi-arch feature to allow for heterogenous clusters.
@mikestef9 I should have been more specific mixed managed node groups.
@ckdarby We will be launching Arm support for managed node groups as part of the GA launch. Node groups will still a single instance type though, so you can have heterogenous clusters with multiple node groups. Are you asking for a single managed node group with multiple instance types?
@mikestef9 Thanks for the update, excited to see GA come :)
Are you asking for a single managed node group with multiple instance types?
Nope, just multiple managed groups with some ARM and some not.
So while we are waiting for official support, at the time of writing it is possible to create a mixed mode EKS cluster, where some nodegroups run amd64
architecture and some with arm64
.
It's pretty straightforward, to do by broadly following AWS guide, with the exception of:
CoreDNS
, kube-proxy
and aws-node
manifests in the Enable ARM support
section firstamd64
deployments/daemonsets in kube-system
(e.g.) kube-proxy-arm64
manifest
---
kind: DaemonSet
apiVersion: apps/v1
metadata:
labels:
k8s-app: kube-proxy-arm64
eks.amazonaws.com/component: kube-proxy
name: kube-proxy-arm64
namespace: kube-system
spec:
selector:
matchLabels:
k8s-app: kube-proxy-arm64
...
Making these changes will ensure when the Graviton arm64
nodes join the EKS cluster, they will have the correct architecture containers deployed to them and become ready
.
Amazon EKS support for Arm-based instances is now generally available! See the launch blog and EKS documentation for more details.
Notable updates with general availability include:
I upgraded to EKS 1.17 to leverage the support of ARM architecture ( r6g.large instances). The instance joins the cluster but when i run "kubectl get nodes", all instances have clear status except the ARM instance which comes with "Unknown" status, "Unknown" name.
I'm trying to add an ARM nodegroup to the existing EKS cluster with a non-arm node group. After creating though I get this error,
Tried this by creating unmanaged nodegroups using eksctl still faced the same error. ARM node is in NotReady state
Amazon EKS now supports Arm processor EC2 A1 instances as a developer preview. You can now run containers using EC2 A1 instances on a Kubernetes cluster that is managed by Amazon EKS.
Learn more and get started here: https://github.com/aws/containers-roadmap/tree/master/preview-programs/
Learn more about Amazon A1 instances: https://aws.amazon.com/ec2/instance-types/a1/
Please leave feedback and comments on the preview using this ticket.