aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 318 forks source link

[EKS] [request]: Rolling update to change instance type for Managed Nodes #746

Closed badaldavda closed 4 years ago

badaldavda commented 4 years ago

Community Note

Tell us about your request What do you want us to build? Currently when we edit nodegroup/update nodegroup config we can only update scaling config - https://docs.aws.amazon.com/cli/latest/reference/eks/update-nodegroup-config.html

Can we also change instance type for managed nodes and have a rolling update?

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Currently, to update a nodegroup we need to create a new nodegroup and then delete the older nodegroup. But in some cases modifying the same nodegroup with other instance type would be needed in the same way we change the nodegroup version to latest release version.

Are you currently working around this issue? Currently, to update a nodegroup we need to create a new nodegroup and then delete the older nodegroup.

Additional context NA

Attachments NA

cdenneen commented 4 years ago

@badaldavda I believe the proposed method for any sort of nodegroup update is a new nodegroup and then a deletion of the old one.

mikestef9 commented 4 years ago

This will be possible with #585

8398a7 commented 4 years ago

@badaldavda I believe the proposed method for any sort of nodegroup update is a new nodegroup and then a deletion of the old one.

In this case, the user would need to take the following steps

  1. Adding taint to old node groups
  2. Move the Pod to a new node group
  3. When the move is complete, delete the old node group

Now when I delete a group of nodes, it looks like all nodes start the termination process at the same time. I don't want them to be deleted at the same time, so I think you'll need to move the Pod yourself in step 2. This is a lot of work, and I'd also like to see managed node groups be able to update rolling instance types.

mikestef9 commented 4 years ago

While this will be possible #585, one thing to note is that if you switch to a smaller instance type, there is a chance you could disrupt running workloads if there are not sufficient resources available on the instances after the update. We will document this behavior, but something to keep in mind you if you choose to leverage this functionality.

mikestef9 commented 4 years ago

Closing as this feature request is addressed by launch template support. See #585 for details!

shibataka000 commented 3 years ago

@mikestef9 I think this issue is not resolved in case of spot managed node groups completely.

As you said, we can perform rolling update to change a instance type of managed node groups when we pass instance type through the launch template. But in case of spot managed node groups, passing a instance type through the launch template is NOT recommended far as I understand.

https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html#managed-node-group-capacity-types says

When deploying your node group with the Spot capacity type that's using a custom launch template, use the API to pass multiple instance types instead of passing a single instance type through the launch template. For more information about deploying a node group using a launch template, see Launch template support.

We still cannot perform rolling update to change instances types of managed node groups which have multiple instance types.

Could you reopen this issue? Or should I create another issue?

psyhomb commented 2 years ago

Now when I delete a group of nodes, it looks like all nodes start the termination process at the same time. I don't want them to be deleted at the same time, so I think you'll need to move the Pod yourself in step 2. This is a lot of work, and I'd also like to see managed node groups be able to update rolling instance types.

This is still an issue and not possible in 2022, it's only working when updating AMI and Maximum unavailable in update config is set to 1- using same node group, but if we change instance type - triggers creation of a new node group, we still encounter situation where all nodes that are running in an old node group, will be drained and terminated at the same time, which is definitely not acceptable in production environment because it can cause downtime. One possible solution would be to somehow apply update config between node groups as well and that way if Maximum unavailable is set to 1, start draining and terminating nodes running in an old node group one node at a time.

arunsisodiya commented 1 year ago

Do we have any resolution on this? Still, in 2023, we are facing this issue. In the production environment, it is really difficult to change the instance-type as it is not following the zero downtime method.

Does any of you have any workaround for EKS-managed node groups to handle this situation?

wxGold commented 1 year ago

Do we have any resolution on this? Still, in 2023, we are facing this issue. In the production environment, it is really difficult to change the instance-type as it is not following the zero downtime method.

Does any of you have any workaround for EKS-managed node groups to handle this situation?

Hey ArunSisodiya, Have you manage somehow to figure this? Im facing this issue also when using node grouop module with instance_types from the node group resource and not launch_template for example instance_types = ["m6a.4xlarge", "m5a.4xlarge" ] to instance_types = ["m6a.4xlarge"]

`Terraform will perform the following actions:

  # module.node_group_tools.aws_eks_node_group.this[0] must be replaced
+/- resource "aws_eks_node_group" "this" {
      ~ ami_type               = "CUSTOM" -> (known after apply)
      ~ arn                    = "arn:aws:eks:eu-west-2:218111588114:nodegroup/example/tools-Kpb_zA/x6x398f7-x1x5-x25x-067x-xx6xxx927dxx" -> (known after apply)
      ~ capacity_type          = "ON_DEMAND" -> (known after apply)
      ~ disk_size              = 0 -> (known after apply)
      ~ id                     = "example:tools-Kpb_zA" -> (known after apply)
      ~ instance_types         = [ # forces replacement
            # (1 unchanged element hidden)
            "m5a.4xlarge",
          - "m5.4xlarge",
        ]
      + node_group_name_prefix = (known after apply)
      ~ release_version        = "ami-020622bc6d23e2c90" -> (known after apply)
      ~ resources              = [
          - {
              - autoscaling_groups              = [
                  - {
                      - name = "eks-tools-Kpb_zA-x6x398f7-x1x5-x25x-067x-xx6xxx927dxx"
                    },
                ]
              - remote_access_security_group_id = ""
            },
        ] -> (known after apply)
      ~ status                 = "ACTIVE" -> (known after apply)
        tags                   = {
            "Client"                    = "example"
            "Environment"               = "dynamic"
            "Name"                      = "tools"
            "Owner"                     = "terraform"
            "kubernetes.io/cluster/eks" = "owned"
            "workload_type"             = "tools"
        }
      ~ version                = "1.22" -> (known after apply)
        # (6 unchanged attributes hidden)

      ~ launch_template {
            id      = "lt-0fb3efe295efbd2e6"
          ~ name    = "example-tools-lt-Kpb_zA" -> (known after apply)
            # (1 unchanged attribute hidden)
        }

      - update_config {
          - max_unavailable            = 1 -> null
          - max_unavailable_percentage = 0 -> null
        }

        # (2 unchanged blocks hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

module.node_group_tools.aws_eks_node_group.this[0]: Creating...
╷
│ Error: error creating EKS Node Group (example:tools-Kpb_zA): ResourceInUseException: NodeGroup already exists with name tools-Kpb_zA and cluster name example
│ {
│   RespMetadata: {
│     StatusCode: 409,
│     RequestID: "9a567eab-0983-424a-aefb-1d13dbd3b857"
│   },
│   ClusterName: "example",
│   Message_: "NodeGroup already exists with name tools-Kpb_zA and cluster name example",
│   NodegroupName: "tools-Kpb_zA"
│ }
│ 
│   with module.node_group_tools.aws_eks_node_group.this[0],
│   on .terraform/modules/node_group_tools/main.tf line 322, in resource "aws_eks_node_group" "this":
│  322: resource "aws_eks_node_group" "this" {`

why should it be recreated if the main first type still persist or even if t must to recreate then why not with a new name (random suffix i have added) as it works when i do other changes , and life cycle is configured for create_before_destroy