aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

[EKS] [request]: Managed Node Groups support for node taints #864

Closed mikestef9 closed 3 years ago

mikestef9 commented 4 years ago

Community Note

Tell us about your request Add support for tainting nodes through managed node groups API

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Managed nodes supports adding Kubernetes labels as part of node group creation. This makes it easy for all nodes in a node group to have consistent labels. However, taints are not supported through the API.

Are you currently working around this issue? Manual kubectl commands after new nodes in node group come up.

TBBle commented 4 years ago

When this was raised in #585, #507 was tagged as an existing request for this feature, but I think that was confusion... #507 seems to be about Container Insights correctly monitoring tainted nodes, while what we want here (and in #585) is to support setting the taints on Managed Nodegroups as part of a rollout, e.g. with eksctl.

The comment in #585 had nine thumbs-up, on top of the three currently here.

mikestef9 commented 4 years ago

@TBBle correct, I wanted to open a separate issue to explicitly track tainting node groups through the EKS API

karstenmueller commented 4 years ago

@mikestef9 we would like to see "tainting node groups through the EKS API" progressing and bumped it from #12 👍 to #37 as of now.

aviau commented 4 years ago

It looks like the bootstrap script used by eks nodes already support taints. My understanding is that it would be a small feature to implement because it would only require to modify the userdata in the launch template to add extra args, just like its done for labels currently.

AlbertoPeon commented 4 years ago

We would love to have this!

jhcook-ag commented 4 years ago

"When nodes are created dynamically by the Kubernetes autoscaler, they need to be created with the proper taint and label. With EKS, the taint and label can be specified in the Kubernetes kubelet service defined in the UserData section of the AWS autoscaling group LaunchConfiguration."

https://docs.cloudbees.com/docs/cloudbees-ci/latest/cloud-admin-guide/eks-auto-scaling-nodes

TBBle commented 4 years ago

@jhcook-ag You can't specify the UserData for Managed Node Groups when you create them.

You can modify the UserData in the Launch Configuration in the AWS console after creation, but then the Managed Node Groups feature will refuse to touch your Launch Configuration again, and you're effectively now using unmanaged Node Groups, although eksctl will still try to use the Managed Node Groups API and fail.

jhcook-ag commented 4 years ago

@mhausenblas we really need this 👍

borisputerka-zz commented 4 years ago

Absolutely would love the idea.

Lincon-Freitas commented 4 years ago

It is a must-have feature!

vcucereanu commented 4 years ago

👍

martinoravsky commented 4 years ago

This is a must-have feature for us as well. We can't use managed node groups because of this. When would you expect this to be released? (just roughly) 👍

Dudssource commented 4 years ago

Hi @martinoravsky, I believe this feature is available now.

https://aws.amazon.com/blogs/containers/introducing-launch-template-and-custom-ami-support-in-amazon-eks-managed-node-groups/

We did it by customizing the userdata on the custom launch template and specifying the taints for the kubelet (using the register-with-taints argument).

martinoravsky commented 4 years ago

Hi @Dudssource ,

are you using custom AMIs? I'm using launch templates with EKS optimized AMIs which include UserData that bootstraps the node to the cluster automatically (with --kubelet-extra-args empty). This userdata is not editable for us, we can only add our own UserData as MIME multipart file which has no effect on bootstrapping the cluster. I'm curious if you were able to get this to work without custom AMIs.

Dudssource commented 4 years ago

@martinoravsky, yes unfortunately we had to use a custom AMI for this to work. But we used the same optimized AMI that EKS uses, we use terraform so we used a datasource to get the latest AMI for our cluster version. I know that this is possible with Cloudformation and parameter store too.

mikestef9 commented 4 years ago

The approach that @Dudssource used here is certainly an option, but we do plan to add taints directly to the EKS API (similar to labels), so that a custom AMI is not required.

lwimmer commented 4 years ago

I've found a solution (admittedly quite hackish) to allow setting taints with the offical AMIs:

Set the userdata for the Launch Template similar to this:

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==7561478f-5b81-4e9d-9db6-aec8f463d2ab=="

--==7561478f-5b81-4e9d-9db6-aec8f463d2ab==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash
sed -i '/^KUBELET_EXTRA_ARGS=/a KUBELET_EXTRA_ARGS+=" --register-with-taints=foo=bar:NoSchedule"' /etc/eks/bootstrap.sh

--==7561478f-5b81-4e9d-9db6-aec8f463d2ab==--\

This script is run before the bootstrap script, which is managed by EKS, patching the /etc/eks/bootstrap.sh to inject the necessary --register-with-taints in the KUBELET_EXTRA_ARGS variable. This solution is not perfect and might break if AWS changes the bootstrap script, but it works for now and can be used until there is proper support for taints.

dwilliams782 commented 4 years ago

@lwimmer That is superbly hacky! Good work.

I'm really surprised this feature is missing, and overall I'm shocked how feature incomplete node groups are.

markthebault commented 3 years ago

+1

DovAmir commented 3 years ago

+1

ArchiFleKs commented 3 years ago

Thanks @lwimmer , I tried to implement your logic here in the official terraform module: https://github.com/terraform-aws-modules/terraform-aws-eks/pull/1138

andre-lx commented 3 years ago

Thanks @lwimmer. I made it based on your solution in terraform. This is my full node group with custom taints:

resource "aws_eks_node_group" "some_nodegroup" {
  node_group_name = "some_nodegroup"

  cluster_name  = aws_eks_cluster.eks_cluster.name
  node_role_arn = aws_iam_role.eks_nodegroup_role.arn
  subnet_ids    = aws_subnet.public_subnet.*.id

  instance_types = [
    ...,
  ]

  scaling_config {
    desired_size = ...
    max_size     = ...
    min_size     = ...
  }

  launch_template {
    id      = aws_launch_template.some_launch_template.id
    version = aws_launch_template.some_launch_template.latest_version
  }

  labels = {
    type = "some-label"
  }

  depends_on = [
    aws_iam_role_policy_attachment.iam-eks-nodegroup-AmazonEKSWorkerNodePolicy,
    aws_iam_role_policy_attachment.iam-eks-nodegroup-AmazonEKS_CNI_Policy,
    aws_iam_role_policy_attachment.iam-eks-nodegroup-AmazonEC2ContainerRegistryReadOnly,
  ]

  lifecycle {
    ignore_changes = [scaling_config[0].desired_size]
  }
}

resource "aws_launch_template" "some_launch_template" {
  name = "some_lauch_template"

  block_device_mappings {
    device_name = "/dev/xvda"

    ebs {
      volume_size = 20
      volume_type = "gp2"
    }
  }

  user_data = base64encode(<<-EOF
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==7561478f-5b81-4e9d-9db6-aec8f463d2ab=="

--==7561478f-5b81-4e9d-9db6-aec8f463d2ab==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash
sed -i '/^KUBELET_EXTRA_ARGS=/a KUBELET_EXTRA_ARGS+=" --register-with-taints=some_taint1=true:NoSchedule,some_taint2=true:NoSchedule"' /etc/eks/bootstrap.sh

--==7561478f-5b81-4e9d-9db6-aec8f463d2ab==--\
  EOF
  )

  tag_specifications {
    resource_type = "instance"

    tags = {
      Name = "..."
    }
  }
}

For multiple taints:

<key1>=<value1>:<effect1>,<key2>=<value2>:<effect2>
hintofbasil commented 3 years ago

I have found that the preBootstrapCommands parameter for managed node groups is a much easier way to add a taint than using user data.

preBootstrapCommands:
- sed -i '/^KUBELET_EXTRA_ARGS=/a KUBELET_EXTRA_ARGS+=" --register-with-taints=<key1>=<value1>:<effect1>"' /etc/eks/bootstrap.sh
lwimmer commented 3 years ago

I have found that the preBootstrapCommands parameter for managed node groups is a much easier way to add a taint than using user data

preBootstrapCommands are specific to eksctl, not everyone is using eksctl (see the Terraform solutions here)

TBBle commented 3 years ago

eksctl implements preBootstrapCommands by populating user data in the launch template, exactly as is being done in the examples here.

josephprem commented 3 years ago

With user-data I ended up doing this . In my case I wanted to switch to systemd

# Systemd does not support appending environment variables. Add a new variable
sed -i 's/KUBELET_EXTRA_ARGS/KUBELET_EXTRA_ARGS $EXTENDED_KUBELET_ARGS/' /etc/systemd/system/kubelet.service

cat << EOF > /etc/systemd/system/kubelet.service.d/9999-extended-kubelet-args.conf
[Service]
Environment='EXTENDED_KUBELET_ARGS=--cgroup-driver=systemd'
EOF
systemctl daemon-reload
headyj commented 3 years ago

Hi @andre-lx,

I have almost the same code as you, but I cannot apply my launch config:

error creating EKS Node Group: InvalidParameterException: Remote access configuration cannot be specified with a launch template.

I tried with and without key_name in my launch_config, but it's still not working. Any idea?

TBBle commented 3 years ago

@josephprem

With user-data I ended up doing this . In my case I wanted to switch to systemd

If you edit your comment to put a line with ``` (that's three back-ticks) before and after your user-data, it won't parse it as markdown. That will also fix GitHub turning the comment line into a top-level heading, i.e. it will look like this, which I imagine is what you intended?

# Systemd does not support appending environment variables. Add a new variable
sed -i 's/KUBELET_EXTRA_ARGS/KUBELET_EXTRA_ARGS $EXTENDED_KUBELET_ARGS/' /etc/systemd/system/kubelet.service

cat << EOF > /etc/systemd/system/kubelet.service.d/9999-extended-kubelet-args.conf
[Service]
Environment='EXTENDED_KUBELET_ARGS=--cgroup-driver=systemd'
EOF
systemctl daemon-reload
EvertonSA commented 3 years ago

I reckon solutions presented here are not "managed services like". Since I pay 0,10 USD per hour of running clusters, I would expect not to mess around with the kubelet config.

I as a customer would like to add a flag to my node_group that eks understand that all new nodes (and current ones) have taints applyed automatically.

I as a customer would like to use terraform to provision the taint capability.


anothe comment, this issue is the second most upvoted request in the container roadmap. I hope aws is not spending time in some "useless" features I see on the roadmap

Overbryd commented 3 years ago

@EvertonSA I am after this since about two years now, and I have solved it two ways for my customers:

matglas commented 3 years ago

We have settled on option 1, so you will be able to specify zero as a minimum when creating a node group. For an update on implementation, we have decided to do the work in Cluster Autoscaler itself to pull labels, taints, and extended resources directly from the managed node groups API, rather than tagging the underlying ASG. You can follow the progress here

kubernetes/enhancements#2561

@mikestef9 as #724 is coming to a conclusion on managed node groups scaling and mentioning taints, will this topic come to a conclusion too?

crisp2u commented 3 years ago

@lwimmer this stopped working for me some since yesterday. Any new group or new node in a group that uses launch template is left in a 'Not ready' state with network not initialized. Groups without launch templates are fine. Any idea ?

pierluigilenoci commented 3 years ago

@EvertonSA 🥳

Yeeeee

teochenglim commented 3 years ago

Guys,

I found the same implementation on AWS terraform workshop https://github.com/aws-samples/terraform-eks-code/blob/master/extra/nodeg2/user_data.tf

Having say that I am hoping to simple pass in a simple flag in terraform param and not the twist and turn way and get it done and worry about when next release, this feature doesn't work again, just like the eksctl prebootstrap.

I don't know how difficult will be "if define" this and plug this value in just like the "helm chart" to be implemented on terraform or eksctl . We, community can twist and turn to provide a solution. But overall I think all these reasonable production ready feature shall be inside allow all of us to extend the functionality properly. These has to be answered by AWS and committed, if in next release it is wipe out again, why should anyone waste time doing it?

Cheers.

ArchiFleKs commented 3 years ago

Hi, this has been merged and it seems to still work with official AMI.

I have just tested with the following configuration:

  node_groups = {
    "default-${local.aws_region}a" = {
      ami_type = "AL2_ARM_64"
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[0]]
      disk_size        = 20
    }

    "default-${local.aws_region}b" = {
      ami_type = "AL2_ARM_64"
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[1]]
      disk_size        = 20
    }

    "default-${local.aws_region}c" = {
      ami_type = "AL2_ARM_64"
      create_launch_template        = true
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[2]]
      kubelet_extra_args = "--node-labels=role=private --register-with-taints=dedicated=private:NoSchedule"
      disk_size        = 20
    }
    "taint-${local.aws_region}c" = {
      create_launch_template        = true
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t3a.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[2]]
      kubelet_extra_args = "--node-labels=role=private --register-with-taints=dedicated=private:NoSchedule"
      disk_size        = 20
    }
  }

And it is working as expected, this PR is based on the fix found here

EKami commented 3 years ago

Hi, this has been merged and it seems to still work with official AMI.

I have just tested with the following configuration:

  node_groups = {
    "default-${local.aws_region}a" = {
      ami_type = "AL2_ARM_64"
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[0]]
      disk_size        = 20
    }

    "default-${local.aws_region}b" = {
      ami_type = "AL2_ARM_64"
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[1]]
      disk_size        = 20
    }

    "default-${local.aws_region}c" = {
      ami_type = "AL2_ARM_64"
      create_launch_template        = true
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[2]]
      kubelet_extra_args = "--node-labels=role=private --register-with-taints=dedicated=private:NoSchedule"
      disk_size        = 20
    }
    "taint-${local.aws_region}c" = {
      create_launch_template        = true
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t3a.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[2]]
      kubelet_extra_args = "--node-labels=role=private --register-with-taints=dedicated=private:NoSchedule"
      disk_size        = 20
    }
  }

And it is working as expected, this PR is based on the fix found here

Still not working for me, even with create_launch_template set to true :(

EvertonSA commented 3 years ago

Hi, this has been merged and it seems to still work with official AMI.

I have just tested with the following configuration:

  node_groups = {
    "default-${local.aws_region}a" = {
      ami_type = "AL2_ARM_64"
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[0]]
      disk_size        = 20
    }

    "default-${local.aws_region}b" = {
      ami_type = "AL2_ARM_64"
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[1]]
      disk_size        = 20
    }

    "default-${local.aws_region}c" = {
      ami_type = "AL2_ARM_64"
      create_launch_template        = true
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[2]]
      kubelet_extra_args = "--node-labels=role=private --register-with-taints=dedicated=private:NoSchedule"
      disk_size        = 20
    }
    "taint-${local.aws_region}c" = {
      create_launch_template        = true
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t3a.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[2]]
      kubelet_extra_args = "--node-labels=role=private --register-with-taints=dedicated=private:NoSchedule"
      disk_size        = 20
    }
  }

And it is working as expected, this PR is based on the fix found here

Thanks for the PR, I will try it out. Your solution is very elegant. Unfortunately, some of our already running environments are provisioned using rawly terraform resources. I have no clue how much effort would it take to migrate to use the terraform aws eks module. I might give it a shot on our development environments in the next following weeks.

Although I strongly support your development, I still think taints should be accepted here: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/eks_node_group. What happens if I open a ticket with mr. Bezos and he replies that they will go further with the incident because I'm using a community module (that modifies the kubelet default behavior) instead of the official product API?

I really don't know what are the implications of using community modules, but according to the documentation, only Enterprise Business Support can have "Third-party software support". So #AWS, if I encorage my team to fully migrate to this module, will I have issues with your definition of "Third-party software support"? Does "Third-party software support" includes kubelet default behavior modifications?

We eagerly wait for a response.

EvertonSA commented 3 years ago

I could not find a definition for "Third-party software support" other than:

Third-party software support – Help with Amazon Elastic Compute Cloud (Amazon EC2) instance operating systems and configuration. Also, help with the performance of the most popular third-party software components on AWS. Third-party software support isn't available for customers on Basic or Developer Support plans.

teochenglim commented 3 years ago

Hello guys,

I have been working on these for couple of months (with or without terraform). It will not work no matter how hard you tried. That's a problem on EKS manage nodegroup it will plug their own user-data behind your user-data. AWS created another secondary launchtemplate on your behave and the user-data from running instance is getting from the new the new launchtemplate.

You can verify this on your EC2 node then you will know what am i talking about. And you can compare your launch template on the AWS console and on the running instance launch template.

# ssh [your_eks_node]
$ curl http://169.254.169.254/latest/user-data

You can also manually view the launchtemplate (sorted by recent date)

You can still do it, but the status of the nodegroup creation is "NodeCreationFailure" after waited 20 minutes for each try.

Cheers, Cheng Lim

ArchiFleKs commented 3 years ago

Hi, this has been merged and it seems to still work with official AMI. I have just tested with the following configuration:

  node_groups = {
    "default-${local.aws_region}a" = {
      ami_type = "AL2_ARM_64"
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[0]]
      disk_size        = 20
    }

    "default-${local.aws_region}b" = {
      ami_type = "AL2_ARM_64"
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[1]]
      disk_size        = 20
    }

    "default-${local.aws_region}c" = {
      ami_type = "AL2_ARM_64"
      create_launch_template        = true
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[2]]
      kubelet_extra_args = "--node-labels=role=private --register-with-taints=dedicated=private:NoSchedule"
      disk_size        = 20
    }
    "taint-${local.aws_region}c" = {
      create_launch_template        = true
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t3a.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[2]]
      kubelet_extra_args = "--node-labels=role=private --register-with-taints=dedicated=private:NoSchedule"
      disk_size        = 20
    }
  }

And it is working as expected, this PR is based on the fix found here

Thanks for the PR, I will try it out. Your solution is very elegant. Unfortunately, some of our already running environments are provisioned using rawly terraform resources. I have no clue how much effort would it take to migrate to use the terraform aws eks module. I might give it a shot on our development environments in the next following weeks.

Although I strongly support your development, I still think taints should be accepted here: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/eks_node_group. What happens if I open a ticket with mr. Bezos and he replies that they will go further with the incident because I'm using a community module (that modifies the kubelet default behavior) instead of the official product API?

I really don't know what are the implications of using community modules, but according to the documentation, only Enterprise Business Support can have "Third-party software support". So #AWS, if I encorage my team to fully migrate to this module, will I have issues with your definition of "Third-party software support"? Does "Third-party software support" includes kubelet default behavior modifications?

We eagerly wait for a response.

I honestly do not know about the support but using custom launch template is supposed to be supported on AWS, if you have support and are using the official AMI, I do not see why you would loose the support. I guess the same thing could apply to people using custom AMI that AWS have not way to verified, do they also loose support ?

jfoechsler commented 3 years ago

@teochenglim Not sure what you are referring to, but providing user data in managed nodegroup launch template works fine and is merged in the EKS created launch template.

edit What I meant to say is yes there is a known working workaround, but this issue is for support in AWS API, not support forum for Terraform etc. I also missed the status Coming Soon of this issue :+1:

~The fact that this can be used as workaround to add taints by modifying eks bootstrap should in my view obviously not be considered solution. I don't know how this is even missing in EKS when taints has been Kubernetes feature since long time ago. In Azure managed node pools it has pretty much always been supported~

teochenglim commented 3 years ago

Hi ArchiFleKs,

My 2 cents, if everything need to custom, why EKS? We should run on perm Kubernetes.

Yes, that's an option that custom launch template is supported on AWS now. But is has bug.

And to be fair people just mixing around everything now. Some are talking about terraform module, some are talking about eksctl, some are talking about custom workgroup or manage workgroup. And you are talking about office AMI?

But base on my simple troubleshooting, an extra launchtemplate is created and your manage node group is pointing to that. This behaviour is the same for terraform or manually do it on AWS console. I am yet to try eksctl but why should i try it since i am no longer using it?

teochenglim commented 3 years ago

@teochenglim Not sure what you are referring to, but providing user data in managed nodegroup launch template works fine and is merged in the EKS created launch template.

The fact that this can be used as workaround to add taints by modifying eks bootstrap should in my view obviously not be considered solution. I don't know how this is even missing in EKS when taints has been Kubernetes feature since long time ago. In Azure managed node pools it has pretty much always been supported:

az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name taintnp \
    --node-count 1 \
    --node-taints sku=gpu:NoSchedule \

I try it today and it doesn't work for me. Can you show me your working version? This is EKS, why you showing aks? BTW we are creating EKS using terraform, we can't add nodegroup using eksctl

TBBle commented 3 years ago

Given this has gone from "We're Working On It" to "Coming Soon", presumably it's mostly done, and is being tested/validated/integrated, so "AWS sucks, everyone else has had this forever" isn't really a useful contribution.

Workarounds in the meantime are a useful contribution, I think, but support questions about them does generate a bit of noise in this ticket. Is there a terraform-specific place to debug the terraform-based workaround instead, so this ticket can remain focussed on the Managed Node Groups API for this, and maybe just catalog the workarounds (all using custom launch templates now?).

If custom launch templates aren't working correctly, that's not really a "here" thing either. #585 would be closer, but this isn't really a support forum anyway, so you may not have much luck there.

ArchiFleKs commented 3 years ago

Hi ArchiFleKs,

My 2 cents, if everything need to custom, why EKS? We should run on perm Kubernetes.

Yes, that's an option that custom launch template is supported on AWS now. But is has bug.

And to be fair people just mixing around everything now. Some are talking about terraform module, some are talking about eksctl, some are talking about custom workgroup or manage workgroup. And you are talking about office AMI?

But base on my simple troubleshooting, an extra launchtemplate is created and your manage node group is pointing to that. This behaviour is the same for terraform or manually do it on AWS console. I am yet to try eksctl but why should i try it since i am no longer using it?

You still need tools to orchestrate your infrastructure, whether it is managed or not. Even if you do it by hand with the AWS console or the awscli, Cloudformation, Terraform or eksctl

I agree that AWS EKS managed node group API should expose a native Taint options like it does for the labels. Exposing the kubelet args allow people to customize kubelet as they wish. This allow power user to do custom configuration even with managed node group.

Even if using managed service, you still need to use an AMI (by official I mean this one https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html) or you can build your own.

The behavior if building your own is different than th official one if using user data as explained here. (https://docs.aws.amazon.com/eks/latest/userguide/launch-templates.html#launch-template-user-data).There is a merged involved with official AMI that you do not have when using official AMI (that prevent the pre userdata of being used).

If you can explain your bug in more detail maybe someone here can help. We are trying to build tools (eksctl or terraform-aws-eks) to abstract this part for the user (just like a manage service does).

Personally I'm using the terraform-aws-eks module and this feature has just been release and is working at least with official AMI, I have not tested with custom AMI.

Let me know if I can help you with this.

ArchiFleKs commented 3 years ago

Hi, this has been merged and it seems to still work with official AMI. I have just tested with the following configuration:

  node_groups = {
    "default-${local.aws_region}a" = {
      ami_type = "AL2_ARM_64"
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[0]]
      disk_size        = 20
    }

    "default-${local.aws_region}b" = {
      ami_type = "AL2_ARM_64"
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[1]]
      disk_size        = 20
    }

    "default-${local.aws_region}c" = {
      ami_type = "AL2_ARM_64"
      create_launch_template        = true
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[2]]
      kubelet_extra_args = "--node-labels=role=private --register-with-taints=dedicated=private:NoSchedule"
      disk_size        = 20
    }
    "taint-${local.aws_region}c" = {
      create_launch_template        = true
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t3a.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[2]]
      kubelet_extra_args = "--node-labels=role=private --register-with-taints=dedicated=private:NoSchedule"
      disk_size        = 20
    }
  }

And it is working as expected, this PR is based on the fix found here

Still not working for me, even with create_launch_template set to true :(

Are you using the master version of the module ? Latest release with this PR is only out today in https://github.com/terraform-aws-modules/terraform-aws-eks/releases/tag/v15.2.0

EKami commented 3 years ago

Are you using the master version of the module ? Latest release with this PR is only out today in https://github.com/terraform-aws-modules/terraform-aws-eks/releases/tag/v15.2.0

Oh, I thought it was included in version 15.1.0, I'll try with version 15.2.0 then, thanks! :)

EvertonSA commented 3 years ago

Hi, this has been merged and it seems to still work with official AMI. I have just tested with the following configuration:

  node_groups = {
    "default-${local.aws_region}a" = {
      ami_type = "AL2_ARM_64"
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[0]]
      disk_size        = 20
    }

    "default-${local.aws_region}b" = {
      ami_type = "AL2_ARM_64"
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[1]]
      disk_size        = 20
    }

    "default-${local.aws_region}c" = {
      ami_type = "AL2_ARM_64"
      create_launch_template        = true
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t4g.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[2]]
      kubelet_extra_args = "--node-labels=role=private --register-with-taints=dedicated=private:NoSchedule"
      disk_size        = 20
    }
    "taint-${local.aws_region}c" = {
      create_launch_template        = true
      desired_capacity = 1
      max_capacity     = 3
      min_capacity     = 1
      instance_types    = ["t3a.large"]
      subnets          = [dependency.vpc.outputs.private_subnets[2]]
      kubelet_extra_args = "--node-labels=role=private --register-with-taints=dedicated=private:NoSchedule"
      disk_size        = 20
    }
  }

And it is working as expected, this PR is based on the fix found here

Thanks for the PR, I will try it out. Your solution is very elegant. Unfortunately, some of our already running environments are provisioned using rawly terraform resources. I have no clue how much effort would it take to migrate to use the terraform aws eks module. I might give it a shot on our development environments in the next following weeks. Although I strongly support your development, I still think taints should be accepted here: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/eks_node_group. What happens if I open a ticket with mr. Bezos and he replies that they will go further with the incident because I'm using a community module (that modifies the kubelet default behavior) instead of the official product API? I really don't know what are the implications of using community modules, but according to the documentation, only Enterprise Business Support can have "Third-party software support". So #AWS, if I encorage my team to fully migrate to this module, will I have issues with your definition of "Third-party software support"? Does "Third-party software support" includes kubelet default behavior modifications? We eagerly wait for a response.

I honestly do not know about the support but using custom launch template is supposed to be supported on AWS, if you have support and are using the official AMI, I do not see why you would loose the support. I guess the same thing could apply to people using custom AMI that AWS have not way to verified, do they also loose support ?

yes, I had my ticket dropped few years ago.

EvertonSA commented 3 years ago

Some features took 15 days to change from Comming Soon to Shipped. Other features took months. How long should I wait? Does it make sense using community terraform workarunds if we are now on comming soon?

@TBBle "so "AWS sucks, everyone else has had this forever" isn't really a useful contribution.". Disagre totaly. As a product owner think this is REALLY useful contribution to my product.

TBBle commented 3 years ago

How long should I wait? Does it make sense using community terraform workarunds if we are now on comming soon?

That depends on your needs and priorities. If you need a terraform deployment today, then you can't wait, so don't wait. If you are just tracking this as a blocker for migrating to Managed Node Groups, and are happy with self/un-managed Node Groups in the meantime, then waiting is fine. (I'm in the latter boat, but it's not the only "migration-blocking" feature I'm tracking, and really only applies to the "next cluster" I build, since existing clusters work now)

As for the other part, since you stripped the context of my quote, including the important part, I'll requote it

presumably it's mostly done, and is being tested/validated/integrated, so "AWS sucks, everyone else has had this forever" isn't really a useful contribution.

Leaving aside the toxic phrasing of this feedback, "AWS sucks, everyone else has had this forever" tells a Product Owner nothing about a feature which is already in the delivery pipeline. That sort of information is more useful when deciding if and where to prioritise a feature, or if the PO has (for whatever reason) never looked at their competition's offerings.

Once it's at the stage of the pipeline I presumed it to be at, it's very unlikely that someone is going to slap their forehead and say "Oh! We should just ship that, instead of sitting on the ready-to-go feature in order to feast on the tears of our users" (or whatever reaction one expects from such comments).

This by far the most 👍'd feature request in the Coming Soon bucket (by a multiple of 5 from its next-closest) and I certainly assume that the person/people managing this backlog can count.