aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.2k stars 316 forks source link

[EKS][Fargate] [request]: Transparent EKS fargate management #807

Open khacminh opened 4 years ago

khacminh commented 4 years ago

Community Note

Tell us about your request Transparent management for EKS Fargate

Which service(s) is this request for?

EKS Fargate

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Detail documentation

It would be great if AWS can provide a detail documentation as the way Kubernetes describes its items

No Fargate log

Lately, I suffered an uncomfortable experience when debugging the fargate issues after updating my cluster to version 1.15. The only thing is documented about fargate pod cannot scheduled is:

Pods which do not match a Fargate profile may be stuck as Pending. If a matching Fargate profile exists, you can delete pending pods that you have created to reschedule them onto Fargate

In my case, there is no issue with the Fargate profile. But no useful log or error log could be found on Cloudwatch for troubleshooting .

No resource visualize

There is no fargate history or metrics could be found on either the console or cloudwatch. I think the metrics like Lambda would be really helpful. And comming with cloudwatch integration, the alarms are needed also

nicholasgcoles commented 4 years ago

@khacminh what was the result when you described your pods? kubectl describe pod ...

khacminh commented 4 years ago

Hi @nicholasgcoles,

Here is the result when run kubectl describe pod ...

# kubectl describe pod test-fargate-app-6b4fb446c6-rw2pj

Name:               test-fargate-app-6b4fb446c6-rw2pj
Namespace:          mynamespace
Priority:           2000001000
PriorityClassName:  system-node-critical
Node:               <none>
Labels:             app=test-fargate-app
                    deploy-env=fargate
                    eks.amazonaws.com/fargate-profile=my-fargate-profile
                    pod-template-hash=6b4fb446c6
Annotations:        kubernetes.io/psp: eks.privileged
Status:             Pending
IP:
Controlled By:      ReplicaSet/test-fargate-app-6b4fb446c6
NominatedNodeName:  c62032c57f-3bf20f49a6d644568d59216432b8f221
Containers:
  test-fargate-app:
    Image:      xxxxxxxxxxxx.dkr.ecr.us-east-2.amazonaws.com/simple-app:0.1.0
    Port:       80/TCP
    Host Port:  0/TCP
    Args:
      /bin/sh
      -c
      echo hello;sleep 3600
    Limits:
      cpu:     250m
      memory:  512Mi
    Requests:
      cpu:        250m
      memory:     512Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-bpxsq (ro)
Volumes:
  default-token-bpxsq:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-bpxsq
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

And this one is the result of kubectl get pod ... -o yaml

----
# kubectl get pod test-fargate-app-6b4fb446c6-rw2pj -o yaml

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2020-03-27T01:25:05Z"
  generateName: test-fargate-app-6b4fb446c6-
  labels:
    app: test-fargate-app
    deploy-env: fargate
    eks.amazonaws.com/fargate-profile: my-fargate-profile
    pod-template-hash: 6b4fb446c6
  name: test-fargate-app-6b4fb446c6-rw2pj
  namespace: mynamespace
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: test-fargate-app-6b4fb446c6
    uid: a074c6a5-9753-4bf0-b169-545c58863aae
  resourceVersion: "68357633"
  selfLink: /api/v1/namespaces/mynamespace/pods/test-fargate-app-6b4fb446c6-rw2pj
  uid: 92d5485e-da18-4581-a680-c56c74b148ab
spec:
  containers:
  - args:
    - /bin/sh
    - -c
    - echo hello;sleep 3600
    image: xxxxxxxxxxxx.dkr.ecr.us-east-2.amazonaws.com/simple-app:0.1.0
    imagePullPolicy: IfNotPresent
    name: test-fargate-app
    ports:
    - containerPort: 80
      protocol: TCP
    resources:
      limits:
        cpu: 250m
        memory: 512Mi
      requests:
        cpu: 250m
        memory: 512Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-bpxsq
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: fargate-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: default-token-bpxsq
    secret:
      defaultMode: 420
      secretName: default-token-bpxsq
status:
  nominatedNodeName: c62032c57f-3bf20f49a6d644568d59216432b8f221
  phase: Pending
  qosClass: Guaranteed

EKS automatically add the label: eks.amazonaws.com/fargate-profile: my-fargate-profile to my pod and set schedulerName: fargate-scheduler

nicholasgcoles commented 4 years ago

Did you by any chance happen to change your aws-auth configmap?

I had an issue where I updated mine and there were behind the scenes mutable role bindings broken which no longer allowed for my pods to be scheduled

khacminh commented 4 years ago

It was the first thing that I checked, this is not mentioned on AWS document. Once maintaining my cluster, I find out that EKS has to add a new map role to aws-auth after fargate profile is created to allow fargate pod to connect to my cluster.

apiVersion: v1
kind: ConfigMap
metadata:
  name: aws-auth
  namespace: kube-system
data:
  mapRoles: |
    - groups:
      - system:bootstrappers
      - system:nodes
      rolearn: arn:aws:iam::xxxxxxxxxxxx:role/my-worker-node-role
      username: system:node:{{EC2PrivateDNSName}}
    - groups:
      - system:bootstrappers
      - system:nodes
      - system:node-proxier
      rolearn: arn:aws:iam::xxxxxxxxxxxx:role/eks-fargate-pods
      username: system:node:{{SessionName}}
nicholasgcoles commented 4 years ago

@khacminh this is true. I found that there was still something else going on in the background besides additional role mappings,

So if I ran kubectl apply -f aws-auth.yaml, and had the "correct" role mappings it still broke because of some immutable dependency (I had to open up an AWS support case for them to tell me this).

Obviously not ideal, but if you remove and then re-add your fargate profile does it work?

khacminh commented 4 years ago

@nicholasgcoles I did it before and tried again today but the issue is till there.

mikestef9 commented 4 years ago

Hi @khacminh we are tracking Fargate visibility in the console as part of #640. Have you opened a support case for the issue you are facing?

khacminh commented 4 years ago

Hi @mikestef9,


About my problem, I found out that it the fargate node cannot resolve the DNS of STS in my region. That's why the fargate nodes cannot join my cluster. After checking my VPC, the VPC configuration wasn't updated for months and I created a STS endpoint Interface in the VPC before EKS fargate released. A few minutes after deleting the STS endpoint interface in my VPC, the fargate node could join the cluster.

tommy-couzens commented 4 years ago

We are having the same issue with the STS endpoint when deploying pods to fargate using EKS.

When adding a STS VPC endpoint pods get stuck on pending and no fargate nodes can join the cluster.

khacminh commented 4 years ago

@tommy-couzens it took me days for troubleshooting because there is no fargate log

tpmurthy commented 4 years ago

@tommy-couzens it took me days for troubleshooting because there is no fargate log

@khacminh How did you troubleshoot your issue? My pods are in 'Pending' status. I do not seem to have an STS Endpoint interface. Could you provide any tips to troubleshoot this issue?

khacminh commented 4 years ago

@tpmurthy

I discovered the issue while trying to create a new cluster. I have an ops machine that used for my team to run the kubectl commands. Then I provided it enough Permissions to run eksctl command to create a new cluster with the same configurations (VPC, Subnets, IAM...). Then the eksctl threw out some errors that make it cannot create a new cluster. I traced the error to find out the problem.