aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.19k stars 315 forks source link

[EKS] Managed NodeGroups: Enable Group Metrics Collection for created ASG #762

Closed YaraMohammed closed 1 week ago

YaraMohammed commented 4 years ago

Request Add an option in the managed node groups to enable Group Metrics Collection for the created ASG

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I'm trying to collect more metrics to have a good overview on the instances in service and keep track of the recreated nodes

Are you currently working around this issue? We enable the metrics collection for the groups manually after they are created

Description The managed node groups create an ASG which is fully managed by the node groups and have Group Metrics Collection disabled by default. This is to enable more enhanced monitoring

amazingandyyy commented 3 years ago

I come across this proposal when I try to enable metrics for node_groups when using terraform-aws-modules/eks/aws. I really think this is a feature we much need.

YaodanZhang commented 3 years ago

Any update on this issue?

amazingandyyy commented 3 years ago

[updates] we filed a support ticket with AWS on this, and they suggest us to add voice to this thread, and turn on node group metrics manually as a workaround :/

javs-perez commented 3 years ago

Same suggestion from AWS, adding a +1 here to try influencing that roadmap.

mikestef9 commented 3 years ago

The ASG backing a managed node group is meant to be more of an implementation detail. I realize there is no charge for enabling this, so it is something we could do, but I'd like also to hear more details about what problems you are trying to solve that enabling ASG metrics would help with, that can't be currently solved by more Kubernetes native metrics options like Container Insights or Prometheus.

javs-perez commented 3 years ago

Hey, @mikestef9, thanks for the quick follow-up. We just started using EKS so maybe there is a better way of doing this.

Currently, we create a NodeGroup with ScalingConfig that has MinSize and a MaxSize. We ran into an issue not too long ago, where the number of healthy nodes went below the MinSize for a few mins. If this happened in the future we wanted to alert on it. We use Datadog, and we could create an alert where if healthy nodes are less than let's say 10, alert us. We wanted to make the alert more dynamic, and get the actual MinSize of the ASG. In case we change it in the future we don't have to change the alert.

Do you think there is a better way of achieving this alert? maybe this type of alert is not very useful when we are talking about EKS?

edit: To allow DataDog to collect ASG metrics, we have to enable MetricsCollection in the ASGs we want to monitor.

HenryCook commented 3 years ago

Similar to what @javs-perez has mentioned, we are using Datadog and wish to alert on capacity e.g. % of running nodes out of the max size.

We had a problem where our cluster autoscaler had scaled to max capacity set for the managed node group, so we had pending pods due to insufficient resources. We can remedy this via pending pods potentially, but having these metrics would certainly be beneficial.

cdalar commented 3 years ago

I have created a script to automate this on my CI/CD pipeline. It only uses awscli and jq . So someone might benefit.. https://gist.github.com/cdalar/f5749040ccb7487203738a134767e3fc

Note: change it according to your need like --regions etc.

# Get's the FIRST Cluster on list-clusters. Assuming you only have 1 EKS 
EKS_CLUSTER_NAME=$(aws eks list-clusters --region=eu-central-1 | jq -r .clusters[0])
echo $EKS_CLUSTER_NAME
# First NodeGroup from the list.
NG=$(aws eks list-nodegroups --cluster-name $EKS_CLUSTER_NAME | jq -r '.nodegroups[0]')
echo $NG
# First Autoscaling Group Name
ASG_NAME=$(aws eks describe-nodegroup --cluster-name $EKS_CLUSTER_NAME --nodegroup-name $NG | jq -r '.nodegroup.resources.autoScalingGroups[0].name')

# Enable Autoscaling Group Metrics
aws autoscaling enable-metrics-collection --auto-scaling-group-name $ASG_NAME --granularity "1Minute"

# --------- Extra ---------- 
# Get SNS Topic ARN for Alarms.
SNS_ARN=$(aws sns list-topics | jq -r '.Topics[0].TopicArn')
# EKS Autoscaling Capacity Alarm
EKS_ASG_MAX_SIZE=$(aws cloudformation describe-stacks | jq -r --arg EKS_CLUSTER_NAME "$EKS_CLUSTER_NAME" '.Stacks[] | select( .StackName == $EKS_CLUSTER_NAME+"-eks-nodegroup")' | jq -r '.Parameters[] | select(.ParameterKey == "EksAsgMaxSize") | .ParameterValue')
aws cloudwatch put-metric-alarm --alarm-name "${EKS_CLUSTER_NAME}-EKS NodeGroup EksAsgCapacityAlarm" --evaluation-periods 1 --comparison-operator GreaterThanOrEqualToThreshold --metric-name GroupTotalInstances --period 600 --namespace AWS/AutoScaling --statistic Maximum --threshold $EKS_ASG_MAX_SIZE --dimensions Name=AutoScalingGroupName,Value=$ASG_NAME --ok-actions $SNS_ARN --alarm-actions $SNS_ARN
nanasi880 commented 3 years ago

Any update on this issue? I'm hoping that AWS Managed Services will provide a seamless integration. I'm sick of manually manipulating it.

michelzanini commented 2 years ago

Hi @mikestef9,

I will describe my use case. We use Datadog to monitor Kubernetes/EKS etc... In most cases, yes, you can use other Kubernetes metrics without depending on the ASG.

But there's a case where it's really useful. Imagine you scale your ASG to zero, or delete the node group. What happens in that case is that the Datadog DaemonSet (or cloudwatch container insights) will be uninstalled (there's no more nodes available). That way you stop receiving metrics from K8S and no longer know if you have nodes running or not.

With ASG metrics available, we can catch this case by monitoring the ASG metrics for running instances etc... Those won't stop as they come from the AWS integration of Datadog.

Also, if the ASG metrics are free, why not enable them by default? Will it cause any issue to anyone? I guess not. So maybe there's no need to provide an option to enable/disable.

Just enable it by default! :)

orirawlings commented 2 years ago

@mikestef9 in my use case we also use DataDog for monitoring and we have alerts for when the ASG is at or near max capacity. I'm not sure of a way to track that directly against the EKS Managed Node Group resources and as far as I know CloudWatch doesn't have any metrics for EKS?

mikestef9 commented 2 years ago

Moved to in progress. We are going to enable this flag for newly created managed node groups. Follow this issue for further updates.

nahum-litvin-hs commented 2 years ago
import boto3

eks = boto3.client('eks')
autoscaling = boto3.client('autoscaling')

clusters = eks.list_clusters()['clusters']
for cluster in clusters:
    print(f'cluster: {cluster}')
    nodegroups=eks.list_nodegroups(clusterName=cluster)["nodegroups"]
    for nodegroup in nodegroups:
        print(f'*nodegroup: {nodegroup}')
        autoScalingGroups = eks.describe_nodegroup(clusterName=cluster,nodegroupName=nodegroup)["nodegroup"]["resources"]["autoScalingGroups"]
        for autoScalingGroup in autoScalingGroups:
            print(f'##autoScalingGroup: {autoScalingGroup["name"]}')
            metricsResult = autoscaling.enable_metrics_collection(AutoScalingGroupName=autoScalingGroup["name"],Granularity="1Minute")
            print(f'@@@metricsResult: {metricsResult["ResponseMetadata"]["HTTPStatusCode"]}')

pyhton script to activate metrics on all asgs.

woernfl commented 2 years ago

Hey, any news on this issue?

samuelbaena commented 2 years ago

Any updates?

rpsadarangani commented 2 years ago

Any Updates on this ?

yasinlachiny commented 1 year ago

Any Updates on this ?

sebastian-bugajny commented 1 year ago

Is there update re this case ?

akash123-eng commented 1 year ago

Is there any update on this ? Our organization also needed this feature

vishnu-anil commented 1 year ago

We also need this. Any update?

sebas-w commented 1 year ago

+1 for this, it would be great to be able to configure this for managed node groups!

gauravkohli commented 1 year ago

hey @mikestef9, do we have any updates on the progress of the implementation of this ?

aaroniscode commented 1 year ago

There is a new blog published today to enable this functionality using EventBridge and Lambda: https://aws.amazon.com/blogs/containers/automatically-enable-group-metrics-collection-for-amazon-eks-managed-node-groups/

z0rc commented 1 year ago

In all seriousness, this is not a solution people are asking for. This is temporary workaround and offloads feature implementation to customer, while everyone here are expecting this to be managed (as in Managed Node Group) solution from aws.

orirawlings commented 1 year ago

I agree. Ideally, enabling auto scaling group metrics would be exposed as a field in the MNG API.

lorelei-rupp-imprivata commented 1 year ago

Yeah I would have liked this to be exposed in the aws terraform provider when creating a aws_eks_node_group It feels like it should just be another attribute to pass in true false etc

michelzanini commented 1 year ago

If exposing an attribute takes too long to implement, why not change the behaviour to default to true since this metrics are free. That way later on you can add an attribute for people that want to disable it.... I would image this should be a very quick implementation to do.

lorelei-rupp-imprivata commented 1 year ago

If exposing an attribute takes too long to implement, why not change the behaviour to default to true since this metrics are free. That way later on you can add an attribute for people that want to disable it.... I would image this should be a very quick implementation to do.

I dont think its free. They are ingested into cloudwatch and you still pay for it. I think that is why they are not on by default in AWS Console because it does cost money. I could be wrong though

michelzanini commented 1 year ago

From the docs: "When group metrics are enabled, Amazon EC2 Auto Scaling sends the following metrics to CloudWatch. The metrics are available at one-minute granularity at no additional charge, but you must enable them."

https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-cloudwatch-monitoring.html

michelzanini commented 1 year ago

I had the impression AWS was going to do this as seen on this comment:

Screenshot 2022-11-14 at 11 20 50
nanasi880 commented 1 year ago

https://github.com/aws/containers-roadmap/issues/762#issuecomment-1312358678

This is a very bad idea. I do not want to see this kind of case "can be done with Lambda". I am convinced that this is a problem that should be solved natively by AWS and never by us users with Lambda scripts. I'm sure you're right, "You can run Python scripts in Lambda." I have received such guidance from AWS more times than I care to count. I just want the AWS Managed Services to work together properly. My massive AWS account is full of Lambdas to do these "workarounds" and then creates a barren job of updating them all as the runtime EoL comes along.

nanasi880 commented 1 year ago

I agree that the combination of EventBridge and Lambda is useful. However, pressing users to implement such an implementation should be limited to cases where the use case cannot be generalized. A Python script on Lambda is an easy solution, but it seems to me to be an incorrect approach. Such Lambda functions are a sign of liability.

Is the use case of wanting to enable metrics for auto-scaling groups created by managed node groups that special? Am I being selfish in my thinking? What do you all think?

StefanTUI commented 1 year ago

any updates on this?

seifrajhi commented 1 year ago

We are using a workaround to enable metrics collection for our managed nodegroups.

We would like to get rid of that workaround when we upgrade the cluster, it would be great if we know if there are any progress with this issue

Thank you in advance

Phil1602 commented 11 months ago

Moved to in progress. We are going to enable this flag for newly created managed node groups. Follow this issue for further updates.

Any updates on this @mikestef9 ?

Even on newly created EKS Clusters (Kubernetes 1.27, created 07/2023) this flag is not enabled by default (and not configurable via API either).

AlbertCCheng commented 9 months ago

Any update on this issue?

abstrask commented 9 months ago

I totally agree this should be solved by AWS, sooner rather than later.

In the meantime, in the spirit of @nahum-litvin-hs, I'm sharing a shell script we currently use to workaround this:

#!/bin/bash
set -eu -o pipefail

asg_count=$(aws autoscaling describe-auto-scaling-groups --filters "Name=tag:eks:cluster-name,Values=${CLUSTER_NAME}" --query "length(AutoScalingGroups[])" --output text)

echo "Found ${asg_count} auto-scaling groups associated with cluster \"${CLUSTER_NAME}\""

metrics_disabled=()
IFS=$'\n' read -r -d '' -a metrics_disabled < <( aws autoscaling describe-auto-scaling-groups --filters "Name=tag:eks:cluster-name,Values=${CLUSTER_NAME}" --query 'AutoScalingGroups[?length(EnabledMetrics)==`0`].[AutoScalingGroupName]' --output text && printf '\0' )

echo "${#metrics_disabled[@]} auto-scaling group(s) do not have metrics enabled"

for asg in "${metrics_disabled[@]+"${metrics_disabled[@]}"}" # workaround to avoud "unbound variable" when array is empty
do
    echo "Enabling metrics on \"${asg}\""
    aws autoscaling enable-metrics-collection --granularity 1Minute --auto-scaling-group-name ${asg}
done

This is scheduled to run every hour with a Kubernetes CronJob on all our clusters. We use IRSA with tightly-scoped permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DescribeASG",
            "Effect": "Allow",
            "Action": [
                "autoscaling:DescribeAutoScalingGroups"
            ],
            "Resource": "*"
        },
        {
            "Sid": "EnableMetrics",
            "Effect": "Allow",
            "Action": [
                "autoscaling:EnableMetricsCollection",
                "autoscaling:DisableMetricsCollection"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/eks:cluster-name": [ "${cluster_name}" ]
                }
            }
        }
    ]
}
vl-kp commented 8 months ago

Any update?

dmonagha commented 8 months ago

Any updates on this issue?

mmerickel commented 3 months ago

Moved to "coming soon" 2 years ago? Would like to see this feature - thanks!

barfle commented 3 months ago

Just ran into this as well. Reading this thread it seems AWS has abandoned the request. It would be nice to have someone from AWS write in and confirm though, as now I have to put engineers on this without knowing if AWS plan to fix it tomorrow or never.

mmerickel commented 3 months ago

It’s probably worth noting there is an “official” but very unsatisfying and fairly complicated solution in the docs for something that should be so simple. https://docs.aws.amazon.com/eks/latest/userguide/enable-asg-metrics.html

wellermann commented 3 months ago

An update on this would be really appreciate. Thanks!

pcholakov commented 1 month ago

There is a new blog published today to enable this functionality using EventBridge and Lambda: https://aws.amazon.com/blogs/containers/automatically-enable-group-metrics-collection-for-amazon-eks-managed-node-groups/

This is a poor and partial solution that saddles the customers with operational debt. We should be able to enable ASG metrics declaratively, and also easily obtain the ASG name so that they can be used, e.g. to define alarms directly in infrastructure code. Ideally the ASG name should be output by AWS::EKS::Nodegroup so that it can be wired into other IaC-provisioned resources.

Yes, this can all be done with a script or a custom resource directly through the API. But native support for features like CloudWatch metrics and alarming is exactly what we as AWS customers expect. This is not just some hidden implementation detail. By hiding it, you're just adding friction to the composability of EKS on the overall AWS platform.

ezloj commented 1 month ago

This should really just be a checkbox in the managed node group creation + an API option for terraform provider. Please implement that 🙏

ohad258 commented 2 weeks ago

Any update?

akestner commented 1 week ago

:rocket: As of July 8, 2024, every new EKS managed node group has EC2 Autoscaling Group metrics enabled. If you want to enable Autoscaling Group metrics on existing EKS managed node groups, you can do so via the EC2 Autoscaling Group APIs or Console [1], or you can create new, equivalent EKS managed node groups. Read more about EKS managed node groups and EC2 Autoscaling Groups metrics here [1], [2].

[1] : https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-metrics.html [2] : https://docs.aws.amazon.com/eks/latest/userguide/create-managed-node-group.html