aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.17k stars 313 forks source link

[EKS] [managed node group drain pods due to AZRebalancing]: AZRebalancing is automatically applied, so cannot stop pods from draining in MNG. #1453

Open zeelpatel8 opened 2 years ago

zeelpatel8 commented 2 years ago

Community Note

Tell us about your request Feature request allowing to switch to only cordon on AZ Rebalance and EC2 capacity rebalance on Managed EKS node group.

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? What outcome are you trying to achieve, ultimately, and why is it hard/impossible to do right now? What is the impact of not having this problem solved? The more details you can provide, the better we'll be able to understand and solve the problem.

The main problem is actually the inability to prevent the draining of nodes on certain notifications (AZ Rebalance and EC2 Capacity rebalance).

UseCase: "Gitlab Runner cost reduction while maximizing throughput" Gitlab spins up bare pods for each CICD job in its Kubernetes Executor (https://docs.gitlab.com/runner/executors/kubernetes.html). Since these are bare pods, these will no survive the draining of the node on which they are scheduled resulting in a failed job in Gitlab.

Since these jobs can be restarted if necessary we are using spot instances for cost reduction. We do want to optimize for throughput instead of maximum availability so nodes should only be drained when its absolutely necessary (eg Spot termination notification). Otherwise we want to leave these pods running as long as possible.

Are you currently working around this issue? How are you currently solving this problem?

  1. Update the underlying ASG out of band to disable AZ rebalancing. It is not recommended as per standards.
  2. Creating ONE nodegroup per AZ and split the original nodegroup capacity among those nodegroups. This would result in THREE nodegroups that only operate in a single ASG, which would mean that the rebalancing would never happen.

Additional context With EKS managed node groups we can't control this behavior, like we can in the node termination handler (https://github.com/aws/aws-node-termination-handler/tree/main/config/helm/aws-node-termination-handler) called enableRebalanceDraining, resulting in many unnecessary drained nodes and failed Gitlab jobs. It would be nice to have this option in EKS managed node groups.

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

theintz commented 2 years ago

This is affecting us as well (and many others using long-lived deployments on managed node groups I suppose). Kinda sad to see no comments and no reactions here. Our setup is entirely based in Terraform, so I can see 2 solutions (which are essentially the ones that @zeelpatel8 proposed):

  1. Run awscli a local-exec Provisioner that does the call to disable the rebalancing.
  2. Setup the ASGs with single AZs in the first place. This is difficult to do when they already exist, as they might need to be recreated.

It would be awesome to have a better way of achieving this.

mamoit commented 1 year ago

We're hitting this exact same issue with the exact same usecase as @zeelpatel8. Gitlab executor spins up standalone pods that are completely ignored by the capacity rebalancer and by the AZ rebalancer. Our current "solution" is to have one nodegroup per AZ and manually disable the capacity rebalance on the underlying ASG after the nodegroup creation. Since we're using terraform for the creation of our infra this becomes a really troublesome manual operation prone to errors.

mamoit commented 1 year ago

To whom may find this useful, we worked around the capacity rebalance limitation in terraform issue using a null_resource and a local-exec. It is far from pretty, but it's better than manually changing the ASGs every time there is a change that requires recreating a nodegroup.

The STS part was taken off this reply, you may not need it depending on how you're doing your auth.

resource "null_resource" "nodegroup_asg_" {
  count = length(aws_eks_node_group.main)

  provisioner "local-exec" {
    interpreter = ["/bin/sh", "-c"]
    environment = {
      AWS_DEFAULT_REGION = data.aws_region.current.name
    }
    command = <<EOF
set -e

$(aws sts assume-role --role-arn "${data.aws_iam_session_context.current.issuer_arn}" --role-session-name terraform_asg_no_cap_rebalance --query 'Credentials.[`export#AWS_ACCESS_KEY_ID=`,AccessKeyId,`#AWS_SECRET_ACCESS_KEY=`,SecretAccessKey,`#AWS_SESSION_TOKEN=`,SessionToken]' --output text | sed $'s/\t//g' | sed 's/#/ /g')

aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name ${aws_eks_node_group.main[count.index].name} \
  --no-capacity-rebalance
EOF
  }
}
abin-tiger commented 1 year ago

We're were affected by the same issue. Had to create a support ticket to understand what was really going on.

bentlema commented 1 year ago

I'd like to see a feature to disable AZRebalance for EKS managed node groups as well. We ran into this with our jenkins-operator-managed Jenkins instance unexpectedly restarting at random times.

schniedergers commented 1 year ago

Here's what worked for me in terraform (based off the earlier answer) - it needs var.cluster_name and var.aws_region set:

data "aws_autoscaling_groups" "this" {
  filter {
    name   = "tag:k8s.io/cluster-autoscaler/enabled"
    values = ["true"]
  }
  filter {
    name   = "tag:k8s.io/cluster-autoscaler/${var.cluster_name}"
    values = ["owned"]
  }
}

resource "null_resource" "nodegroup_asg_azbalance_disable" {
  for_each = toset(data.aws_autoscaling_groups.this.names)

  provisioner "local-exec" {
    interpreter = ["/bin/sh", "-c"]
    command     = <<EOF
set -e
aws autoscaling suspend-processes \
  --region ${var.aws_region} \
  --auto-scaling-group-name ${each.key} \
  --scaling-processes AZRebalance
EOF
  }
}
greenlaw commented 1 year ago

This is affecting my team as well, as we currently use managed node groups with autoscaling to run very bursty Job workloads several times per day requiring us to scale from 0 to 100 nodes and back again.

So far I've been unsuccessful in using any of the above workarounds. While I am able to turn off the associated Autoscaling Group's AZ Rebalance feature (it shows as Off), the setting appears to have no effect. We still see undesired rebalancing behavior in the ASG Activity log like this:

At 2023-05-31T20:45:29Z instances were launched to balance instances in zones us-west-2a us-west-2b with other zones resulting in more than desired number of instances in the group. At 2023-05-31T20:45:50Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 8 to 7. At 2023-05-31T20:45:50Z instance i-xxxxxxxxxxxxxx was selected for termination.

We are considering several options, including 1) moving to self-managed node groups, 2) creating two separate single-AZ managed node groups, or 3) evaluating Karpenter as an alternative/supplement to the Cluster Autoscaler.

It would be a lot easier if managed node groups just supported disabling this feature.

carlosjgp commented 1 year ago

Same!

We are using Cluster Autoscaler and the annotation safe-to-evict: false to allow a long-running job to complete but AZ rebalancing is killing the node

pranchals commented 1 year ago

The AZ-Rebalancing also causes false concerns, in cases where the AZ having lesser number of nodes has insufficient capacity for the specified instance type. This result is nodegroup status to appear degraded. Although the workload running in the Ng has sufficient number of nodes to schedule, the nodegroup status appears as "degraded" due to AZRebalancing.

It would be helpful if there is an option available to disable the AZ rebalancing property for autoscalers/ng's or for it's status to only be confined to the Autoscaler events and the Ng status is left unaltered by AZ-rebalancing activities.

(As per my understanding currently there is no way to disable AZRebalancing for autoscaler from AWS console/cli/sdks

Ruben-Sh commented 10 months ago

Hi team, this item has been open since July 2021, over two years, and many EKS users are experiencing this issue - as demonstrated by the comments above. Could you please assign someone to this item and outline a plan to correct this issue please? While this item is outstanding could a workaround be provided by the EKS team please.

tylerpotts commented 10 months ago

I messaged the AWS technical account manager of our company and was told this is part of the official AWS containers roadmap and known by the internal EKS team. He does not have access to the timelines and can't say when this will be fixed.

In case anyone needs to specifically pass credentials, the below worked for me

resource "null_resource" "disable_AZRebalance_on_ASGs" {
  count = local.disable_AZRebalance == true ? length(module.eks.eks_managed_node_groups) : 0

  provisioner "local-exec" {
    interpreter = ["/bin/sh", "-c"]
    environment = {
      AWS_DEFAULT_REGION = var.region
    }
    # Note that I pipe any error messages to /dev/null and write a success/failure message to tmp
    # Otherwise errors will print out your private keys to the console
    command = <<EOF
set -e

export AWS_ACCESS_KEY_ID="${local.aws_access_key}"
export AWS_SECRET_ACCESS_KEY="${local.aws_secret_key}"
export AWS_SESSION_TOKEN="${local.aws_session_token}"

aws autoscaling suspend-processes \
  --auto-scaling-group-name ${module.eks.eks_managed_node_groups[count.index].node_group_autoscaling_group_names[0]} \
  --scaling-processes AZRebalance 2> /dev/null && echo "works" > /tmp/asg_failure${count.index} || echo "disableAZRebalance_on_ASGs failed" > /tmp/asg_failure${count.index}

EOF
  }
  # Need nodegroup names to exist before we can run above
  depends_on = [
    module.eks
  ]
  # Only runs when the nodegroup names change
  triggers = {
    value = module.eks.eks_managed_node_groups[count.index].node_group_autoscaling_group_names[0]
  }

  # Throws error if bash command fails
  lifecycle {
    postcondition {
      #Used base64 of the tmp file contents because the newlines were making it difficult to do comparisons
      condition = fileexists("/tmp/asg_failure${count.index}") ? filebase64("/tmp/asg_failure${count.index}") != "ZGlzYWJsZUFaUmViYWxhbmNlX29uX0FTR3MgZmFpbGVkCg==" : true

      error_message = "ASG bash command in null_resource.disable_AZRebalance_on_ASGs[${count.index}] failed. Output of command has been masked due to sensitive variables. Manually edit the null_resource in order to see the failure."
    }
  }
}
andrewhharmon commented 9 months ago

i see in the first post --no-capacity-rebalance is being set, but in later posts, suspend-processes is being called on the AZRebalance process. Im not sure I understand the difference, is anyone familiar enough to provide some more details on the best way to prevent the ASG from rebalancing?

bobbywatson3 commented 9 months ago

We were struggling with this for weeks. It's a shame that this is still an issue, and it's also a shame that it seems very difficult to find documentation on this unexpected EKS + cluster-autoscaler interaction.

dinukarajapaksha commented 9 months ago

We are facing the same issue with our multizone EKS clusters. Will this be fixed if we use balance-similar-node-groups=true flag in the cluster autoscaler configuration?

booleanbetrayal commented 9 months ago

Just wanted to chime in here and say that local-exec workarounds in Terraform are a painful way to work around an issue that fundamentally breaks Kubernetes clusters' ability to dictate eviction policies. I believe that until the EKS API supports the suspended processes values, AZRebalance should be disabled by default in an EKS hot-patch.

DBBrowne commented 5 months ago

In our case, AZ rebalancing was causing our k8s job nodes to be removed part way through execution. Posting for other internet denizons finding this issue in their search.

We were able to match the ASG event

        {
            "ActivityId": "7f8a081a-d009-4fbb-bed4-57ab63504429",
            "AutoScalingGroupName": "<>",
            "Description": "Terminating EC2 instance: i-04792288ff245cba2",
            "Cause": "At 2024-01-13T06:14:28Z instances were launched to balance instances in zones  eu-west-2a eu-west-2b with other zones resulting in more than desired number of instances in the group.  At 2024-01-13T06:14:57Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 13 to 12.  At 2024-01-13T06:14:57Z instance i-04792288ff245cba2 was selected for termination.",
            "StartTime": "2024-01-13T06:14:57.600000+00:00",
            "EndTime": "2024-01-13T06:17:03+00:00",
            "StatusCode": "Successful",
            "Progress": 100,
            "Details": "{\"Subnet ID\":\"<>",\"Availability Zone\":\"eu-west-2a\"}",
            "AutoScalingGroupARN": <>
        },

To our nodes in CA with some questionable Grafana/Prometheus usage. By selecting by either the node property, or the povider_id, were were able to match:

<some_node_metric>{...instance="ip-10-1-12-255.eu-west-2.compute.internal", ...provider_id="aws:///eu-west-2a/i-04792288ff245cba2"...}

switching off the AZ rebalancing with:

aws autoscaling suspend-processes \
   --scaling-processes AZRebalance --auto-scaling-group-name <>

appears to have resolved this for us. I'll report back if suspending AZ rebalancing turns out to be insufficient for us.

Related issue: https://github.com/kubernetes/autoscaler/issues/6107#issuecomment-1900901699

fcuello-fudo commented 3 months ago

i see in the first post --no-capacity-rebalance is being set, but in later posts, suspend-processes is being called on the AZRebalance process. Im not sure I understand the difference, is anyone familiar enough to provide some more details on the best way to prevent the ASG from rebalancing?

no-capacity-rebalance:

--capacity-rebalance | --no-capacity-rebalance (boolean)
          Enables or disables Capacity Rebalancing. For more information, see
          Use  Capacity Rebalancing to handle Amazon EC2 Spot Interruptions in
          the Amazon EC2 Auto Scaling User Guide

, and from https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-capacity-rebalancing.html:

"Capacity Rebalancing helps you maintain workload availability by proactively augmenting your fleet with a new Spot Instance before a running instance is interrupted by Amazon EC2. "

AZRebalance OTOH,

AZRebalance – Balances the number of EC2 instances in the group evenly across all of the specified
Availability Zones when the group becomes unbalanced