aws / aws-node-termination-handler

Gracefully handle EC2 instance shutdown within Kubernetes
https://aws.amazon.com/ec2
Apache License 2.0
1.63k stars 267 forks source link

Container stuck in CrashLoopBackOff when deployed in ap-southeast-5 #1069

Closed ridzuan5757 closed 2 weeks ago

ridzuan5757 commented 1 month ago

Describe the bug aws-node-termination-handler stuck in CrashLoopBackOff when deployed in AWS Malaysia region (ap-southeast-5).

Steps to reproduce Kubernetes is deployed using Kops using the following command:

kops create cluster --node-count 3 --control-plane-count 3 --control-plane-size t3.medium --node-size t3.medium --control-plane-zones ap-southeast-5a --zones ap-southeast-5a,ap-southeast-5b,ap-southeast-5c

kops update cluster --yes --admin

Expected outcome Containers running normally as deployed in other regions.

Application Logs This is the log from kubectl describe pod

Name:                 aws-node-termination-handler-7d56b6d497-5qp92
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      aws-node-termination-handler
Node:                 i-06965db02543d103c/172.20.1.196
Start Time:           Sun, 15 Sep 2024 04:48:46 +0800
Labels:               app.kubernetes.io/component=deployment
                      app.kubernetes.io/instance=aws-node-termination-handler
                      app.kubernetes.io/name=aws-node-termination-handler
                      k8s-app=aws-node-termination-handler
                      kops.k8s.io/managed-by=kops
                      kops.k8s.io/nth-mode=sqs
                      kubernetes.io/os=linux
                      pod-template-hash=7d56b6d497
Annotations:          <none>
Status:               Running
IP:                   172.20.1.196
IPs:
  IP:           172.20.1.196
Controlled By:  ReplicaSet/aws-node-termination-handler-7d56b6d497
Containers:
  aws-node-termination-handler:
    Container ID:   containerd://f80173b633fd5d2d1fc1cf30efdd959b82443b3fadd567439c1bdc98940b16e0
    Image:          public.ecr.aws/aws-ec2/aws-node-termination-handler:v1.22.0@sha256:fc4b883511887a535a48c8918735be97c15f8f67c66e6aca869ec051091df6a5
    Image ID:       public.ecr.aws/aws-ec2/aws-node-termination-handler@sha256:fc4b883511887a535a48c8918735be97c15f8f67c66e6aca869ec051091df6a5
    Ports:          8080/TCP, 9092/TCP
    Host Ports:     8080/TCP, 9092/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sun, 15 Sep 2024 04:52:08 +0800
      Finished:     Sun, 15 Sep 2024 04:52:08 +0800
    Ready:          False
    Restart Count:  5
    Requests:
      cpu:     50m
      memory:  64Mi
    Liveness:  http-get http://:8080/healthz delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:
      NODE_NAME:                                 (v1:spec.nodeName)
      POD_NAME:                                 aws-node-termination-handler-7d56b6d497-5qp92 (v1:metadata.name)
      NAMESPACE:                                kube-system (v1:metadata.namespace)
      ENABLE_PROBES_SERVER:                     true
      PROBES_SERVER_PORT:                       8080
      PROBES_SERVER_ENDPOINT:                   /healthz
      LOG_LEVEL:                                info
      JSON_LOGGING:                             true
      LOG_FORMAT_VERSION:                       2
      ENABLE_PROMETHEUS_SERVER:                 false
      PROMETHEUS_SERVER_PORT:                   9092
      CHECK_TAG_BEFORE_DRAINING:                true
      MANAGED_TAG:                              aws-node-termination-handler/managed
      USE_PROVIDER_ID:                          true
      DRY_RUN:                                  false
      CORDON_ONLY:                              false
      TAINT_NODE:                               false
      EXCLUDE_FROM_LOAD_BALANCERS:              true
      DELETE_LOCAL_DATA:                        true
      IGNORE_DAEMON_SETS:                       true
      POD_TERMINATION_GRACE_PERIOD:             -1
      NODE_TERMINATION_GRACE_PERIOD:            120
      EMIT_KUBERNETES_EVENTS:                   true
      COMPLETE_LIFECYCLE_ACTION_DELAY_SECONDS:  -1
      ENABLE_SQS_TERMINATION_DRAINING:          true
      QUEUE_URL:                                https://sqs.ap-southeast-5.amazonaws.com/715841329405/monitoring-shell-ronpos-com-nth
      DELETE_SQS_MSG_IF_NODE_NOT_FOUND:         false
      WORKERS:                                  10
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-45qzm (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  kube-api-access-45qzm:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Burstable
Node-Selectors:               <none>
Tolerations:                  node-role.kubernetes.io/control-plane op=Exists
                              node-role.kubernetes.io/master op=Exists
                              node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  kubernetes.io/hostname:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/instance=aws-node-termination-handler,app.kubernetes.io/name=aws-node-termination-handler,kops.k8s.io/nth-mode=sqs
                              topology.kubernetes.io/zone:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/instance=aws-node-termination-handler,app.kubernetes.io/name=aws-node-termination-handler,kops.k8s.io/nth-mode=sqs
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  4m6s                  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Normal   Scheduled         3m27s                 default-scheduler  Successfully assigned kube-system/aws-node-termination-handler-7d56b6d497-5qp92 to i-06965db02543d103c
  Normal   Pulling           3m27s                 kubelet            Pulling image "public.ecr.aws/aws-ec2/aws-node-termination-handler:v1.22.0@sha256:fc4b883511887a535a48c8918735be97c15f8f67c66e6aca869ec051091df6a5"
  Normal   Pulled            3m9s                  kubelet            Successfully pulled image "public.ecr.aws/aws-ec2/aws-node-termination-handler:v1.22.0@sha256:fc4b883511887a535a48c8918735be97c15f8f67c66e6aca869ec051091df6a5" in 16.811s (17.697s including waiting). Image size: 16516861 bytes.
  Normal   Started           2m20s (x4 over 3m9s)  kubelet            Started container aws-node-termination-handler
  Warning  BackOff           108s (x10 over 3m7s)  kubelet            Back-off restarting failed container aws-node-termination-handler in pod aws-node-termination-handler-7d56b6d497-5qp92_kube-system(0441621b-8f9a-45ca-9d22-4209fd83d2b8)
  Normal   Created           96s (x5 over 3m9s)    kubelet            Created container aws-node-termination-handler
  Normal   Pulled            96s (x4 over 3m8s)    kubelet            Container image "public.ecr.aws/aws-ec2/aws-node-termination-handler:v1.22.0@sha256:fc4b883511887a535a48c8918735be97c15f8f67c66e6aca869ec051091df6a5" already present on machine

This is the log from kubectl get log

{"level":"info","time":"2024-09-14T20:52:08Z","message":"Using log format version 2"}
{"level":"info","dry_run":false,"node_name":"i-06965db02543d103c","pod_name":"aws-node-termination-handler-7d56b6d497-5qp92","pod_namespace":"kube-system","metadata_url":"http://169.254.169.254","kubernetes_service_host":"100.64.0.1","kubernetes_service_port":"443","delete_local_data":true,"ignore_daemon_sets":true,"pod_termination_grace_period":-1,"node_termination_grace_period":120,"enable_scheduled_event_draining":true,"enable_spot_interruption_draining":true,"enable_sqs_termination_draining":true,"delete_sqs_msg_if_node_not_found":false,"enable_rebalance_monitoring":false,"enable_rebalance_draining":false,"metadata_tries":3,"cordon_only":false,"taint_node":false,"taint_effect":"NoSchedule","exclude_from_load_balancers":true,"json_logging":true,"log_level":"info","webhook_proxy":"","uptime_from_file":"","enable_prometheus_server":false,"prometheus_server_port":9092,"emit_kubernetes_events":true,"kubernetes_events_extra_annotations":"","aws_region":"","aws_endpoint":"","queue_url":"https://sqs.ap-southeast-5.amazonaws.com/715841329405/monitoring-shell-ronpos-com-nth","check_tag_before_draining":true,"ManagedTag":"aws-node-termination-handler/managed","use_provider_id":true,"time":"2024-09-14T20:52:08Z","message":"aws-node-termination-handler arguments"}
{"level":"fatal","time":"2024-09-14T20:52:08Z","message":"Unable to find the AWS region to process queue events."}

Environment

Lu-David commented 1 month ago

Hi @ridzuan5757 thank you for raising this issue. Unfortunately, NTH v1 does not support ap-southeast-5. We have a separate unreleased branch called NTH v2 that you can try using which should work in that region.

Lu-David commented 1 month ago

Starting a thread here on what the future fix would be: we would need to update aws-sdk-go to v2 as that has the most up to date region information. All the region information provided by aws-sdk-go v1 is outdated (hence NTH failure in ap-southeast-5). We are unsure when we might be able to get to this issue as our team has limited bandwidth, but we welcome contributions if anybody is interested in starting on a fix.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has become stale with no activity.