Closed YaraMohammed closed 1 week ago
I come across this proposal when I try to enable metrics for node_groups
when using terraform-aws-modules/eks/aws
. I really think this is a feature we much need.
Any update on this issue?
[updates] we filed a support ticket with AWS on this, and they suggest us to add voice to this thread, and turn on node group metrics manually as a workaround :/
Same suggestion from AWS, adding a +1 here to try influencing that roadmap.
The ASG backing a managed node group is meant to be more of an implementation detail. I realize there is no charge for enabling this, so it is something we could do, but I'd like also to hear more details about what problems you are trying to solve that enabling ASG metrics would help with, that can't be currently solved by more Kubernetes native metrics options like Container Insights or Prometheus.
Hey, @mikestef9, thanks for the quick follow-up. We just started using EKS so maybe there is a better way of doing this.
Currently, we create a NodeGroup
with ScalingConfig
that has MinSize
and a MaxSize
. We ran into an issue not too long ago, where the number of healthy nodes went below the MinSize
for a few mins. If this happened in the future we wanted to alert on it. We use Datadog, and we could create an alert where if healthy nodes are less than let's say 10, alert us. We wanted to make the alert more dynamic, and get the actual MinSize
of the ASG. In case we change it in the future we don't have to change the alert.
Do you think there is a better way of achieving this alert? maybe this type of alert is not very useful when we are talking about EKS?
edit: To allow DataDog to collect ASG metrics, we have to enable MetricsCollection
in the ASGs we want to monitor.
Similar to what @javs-perez has mentioned, we are using Datadog and wish to alert on capacity e.g. % of running nodes out of the max size.
We had a problem where our cluster autoscaler had scaled to max capacity set for the managed node group, so we had pending pods due to insufficient resources. We can remedy this via pending pods potentially, but having these metrics would certainly be beneficial.
I have created a script to automate this on my CI/CD pipeline. It only uses awscli and jq . So someone might benefit.. https://gist.github.com/cdalar/f5749040ccb7487203738a134767e3fc
Note: change it according to your need like --regions etc.
# Get's the FIRST Cluster on list-clusters. Assuming you only have 1 EKS
EKS_CLUSTER_NAME=$(aws eks list-clusters --region=eu-central-1 | jq -r .clusters[0])
echo $EKS_CLUSTER_NAME
# First NodeGroup from the list.
NG=$(aws eks list-nodegroups --cluster-name $EKS_CLUSTER_NAME | jq -r '.nodegroups[0]')
echo $NG
# First Autoscaling Group Name
ASG_NAME=$(aws eks describe-nodegroup --cluster-name $EKS_CLUSTER_NAME --nodegroup-name $NG | jq -r '.nodegroup.resources.autoScalingGroups[0].name')
# Enable Autoscaling Group Metrics
aws autoscaling enable-metrics-collection --auto-scaling-group-name $ASG_NAME --granularity "1Minute"
# --------- Extra ----------
# Get SNS Topic ARN for Alarms.
SNS_ARN=$(aws sns list-topics | jq -r '.Topics[0].TopicArn')
# EKS Autoscaling Capacity Alarm
EKS_ASG_MAX_SIZE=$(aws cloudformation describe-stacks | jq -r --arg EKS_CLUSTER_NAME "$EKS_CLUSTER_NAME" '.Stacks[] | select( .StackName == $EKS_CLUSTER_NAME+"-eks-nodegroup")' | jq -r '.Parameters[] | select(.ParameterKey == "EksAsgMaxSize") | .ParameterValue')
aws cloudwatch put-metric-alarm --alarm-name "${EKS_CLUSTER_NAME}-EKS NodeGroup EksAsgCapacityAlarm" --evaluation-periods 1 --comparison-operator GreaterThanOrEqualToThreshold --metric-name GroupTotalInstances --period 600 --namespace AWS/AutoScaling --statistic Maximum --threshold $EKS_ASG_MAX_SIZE --dimensions Name=AutoScalingGroupName,Value=$ASG_NAME --ok-actions $SNS_ARN --alarm-actions $SNS_ARN
Any update on this issue? I'm hoping that AWS Managed Services will provide a seamless integration. I'm sick of manually manipulating it.
Hi @mikestef9,
I will describe my use case. We use Datadog to monitor Kubernetes/EKS etc... In most cases, yes, you can use other Kubernetes metrics without depending on the ASG.
But there's a case where it's really useful. Imagine you scale your ASG to zero, or delete the node group. What happens in that case is that the Datadog DaemonSet (or cloudwatch container insights) will be uninstalled (there's no more nodes available). That way you stop receiving metrics from K8S and no longer know if you have nodes running or not.
With ASG metrics available, we can catch this case by monitoring the ASG metrics for running instances etc... Those won't stop as they come from the AWS integration of Datadog.
Also, if the ASG metrics are free, why not enable them by default? Will it cause any issue to anyone? I guess not. So maybe there's no need to provide an option to enable/disable.
Just enable it by default! :)
@mikestef9 in my use case we also use DataDog for monitoring and we have alerts for when the ASG is at or near max capacity. I'm not sure of a way to track that directly against the EKS Managed Node Group resources and as far as I know CloudWatch doesn't have any metrics for EKS?
Moved to in progress. We are going to enable this flag for newly created managed node groups. Follow this issue for further updates.
import boto3
eks = boto3.client('eks')
autoscaling = boto3.client('autoscaling')
clusters = eks.list_clusters()['clusters']
for cluster in clusters:
print(f'cluster: {cluster}')
nodegroups=eks.list_nodegroups(clusterName=cluster)["nodegroups"]
for nodegroup in nodegroups:
print(f'*nodegroup: {nodegroup}')
autoScalingGroups = eks.describe_nodegroup(clusterName=cluster,nodegroupName=nodegroup)["nodegroup"]["resources"]["autoScalingGroups"]
for autoScalingGroup in autoScalingGroups:
print(f'##autoScalingGroup: {autoScalingGroup["name"]}')
metricsResult = autoscaling.enable_metrics_collection(AutoScalingGroupName=autoScalingGroup["name"],Granularity="1Minute")
print(f'@@@metricsResult: {metricsResult["ResponseMetadata"]["HTTPStatusCode"]}')
pyhton script to activate metrics on all asgs.
Hey, any news on this issue?
Any updates?
Any Updates on this ?
Any Updates on this ?
Is there update re this case ?
Is there any update on this ? Our organization also needed this feature
We also need this. Any update?
+1 for this, it would be great to be able to configure this for managed node groups!
hey @mikestef9, do we have any updates on the progress of the implementation of this ?
There is a new blog published today to enable this functionality using EventBridge and Lambda: https://aws.amazon.com/blogs/containers/automatically-enable-group-metrics-collection-for-amazon-eks-managed-node-groups/
In all seriousness, this is not a solution people are asking for. This is temporary workaround and offloads feature implementation to customer, while everyone here are expecting this to be managed (as in Managed Node Group) solution from aws.
I agree. Ideally, enabling auto scaling group metrics would be exposed as a field in the MNG API.
Yeah I would have liked this to be exposed in the aws terraform provider when creating a aws_eks_node_group
It feels like it should just be another attribute to pass in true false etc
If exposing an attribute takes too long to implement, why not change the behaviour to default to true
since this metrics are free. That way later on you can add an attribute for people that want to disable it.... I would image this should be a very quick implementation to do.
If exposing an attribute takes too long to implement, why not change the behaviour to default to
true
since this metrics are free. That way later on you can add an attribute for people that want to disable it.... I would image this should be a very quick implementation to do.
I dont think its free. They are ingested into cloudwatch and you still pay for it. I think that is why they are not on by default in AWS Console because it does cost money. I could be wrong though
From the docs: "When group metrics are enabled, Amazon EC2 Auto Scaling sends the following metrics to CloudWatch. The metrics are available at one-minute granularity at no additional charge, but you must enable them."
https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-cloudwatch-monitoring.html
I had the impression AWS was going to do this as seen on this comment:
https://github.com/aws/containers-roadmap/issues/762#issuecomment-1312358678
This is a very bad idea. I do not want to see this kind of case "can be done with Lambda". I am convinced that this is a problem that should be solved natively by AWS and never by us users with Lambda scripts. I'm sure you're right, "You can run Python scripts in Lambda." I have received such guidance from AWS more times than I care to count. I just want the AWS Managed Services to work together properly. My massive AWS account is full of Lambdas to do these "workarounds" and then creates a barren job of updating them all as the runtime EoL comes along.
I agree that the combination of EventBridge and Lambda is useful. However, pressing users to implement such an implementation should be limited to cases where the use case cannot be generalized. A Python script on Lambda is an easy solution, but it seems to me to be an incorrect approach. Such Lambda functions are a sign of liability.
Is the use case of wanting to enable metrics for auto-scaling groups created by managed node groups that special? Am I being selfish in my thinking? What do you all think?
any updates on this?
We are using a workaround to enable metrics collection for our managed nodegroups.
We would like to get rid of that workaround when we upgrade the cluster, it would be great if we know if there are any progress with this issue
Thank you in advance
Moved to in progress. We are going to enable this flag for newly created managed node groups. Follow this issue for further updates.
Any updates on this @mikestef9 ?
Even on newly created EKS Clusters (Kubernetes 1.27, created 07/2023) this flag is not enabled by default (and not configurable via API either).
Any update on this issue?
I totally agree this should be solved by AWS, sooner rather than later.
In the meantime, in the spirit of @nahum-litvin-hs, I'm sharing a shell script we currently use to workaround this:
#!/bin/bash
set -eu -o pipefail
asg_count=$(aws autoscaling describe-auto-scaling-groups --filters "Name=tag:eks:cluster-name,Values=${CLUSTER_NAME}" --query "length(AutoScalingGroups[])" --output text)
echo "Found ${asg_count} auto-scaling groups associated with cluster \"${CLUSTER_NAME}\""
metrics_disabled=()
IFS=$'\n' read -r -d '' -a metrics_disabled < <( aws autoscaling describe-auto-scaling-groups --filters "Name=tag:eks:cluster-name,Values=${CLUSTER_NAME}" --query 'AutoScalingGroups[?length(EnabledMetrics)==`0`].[AutoScalingGroupName]' --output text && printf '\0' )
echo "${#metrics_disabled[@]} auto-scaling group(s) do not have metrics enabled"
for asg in "${metrics_disabled[@]+"${metrics_disabled[@]}"}" # workaround to avoud "unbound variable" when array is empty
do
echo "Enabling metrics on \"${asg}\""
aws autoscaling enable-metrics-collection --granularity 1Minute --auto-scaling-group-name ${asg}
done
This is scheduled to run every hour with a Kubernetes CronJob on all our clusters. We use IRSA with tightly-scoped permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DescribeASG",
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups"
],
"Resource": "*"
},
{
"Sid": "EnableMetrics",
"Effect": "Allow",
"Action": [
"autoscaling:EnableMetricsCollection",
"autoscaling:DisableMetricsCollection"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/eks:cluster-name": [ "${cluster_name}" ]
}
}
}
]
}
Any update?
Any updates on this issue?
Moved to "coming soon" 2 years ago? Would like to see this feature - thanks!
Just ran into this as well. Reading this thread it seems AWS has abandoned the request. It would be nice to have someone from AWS write in and confirm though, as now I have to put engineers on this without knowing if AWS plan to fix it tomorrow or never.
It’s probably worth noting there is an “official” but very unsatisfying and fairly complicated solution in the docs for something that should be so simple. https://docs.aws.amazon.com/eks/latest/userguide/enable-asg-metrics.html
An update on this would be really appreciate. Thanks!
There is a new blog published today to enable this functionality using EventBridge and Lambda: https://aws.amazon.com/blogs/containers/automatically-enable-group-metrics-collection-for-amazon-eks-managed-node-groups/
This is a poor and partial solution that saddles the customers with operational debt. We should be able to enable ASG metrics declaratively, and also easily obtain the ASG name so that they can be used, e.g. to define alarms directly in infrastructure code. Ideally the ASG name should be output by AWS::EKS::Nodegroup so that it can be wired into other IaC-provisioned resources.
Yes, this can all be done with a script or a custom resource directly through the API. But native support for features like CloudWatch metrics and alarming is exactly what we as AWS customers expect. This is not just some hidden implementation detail. By hiding it, you're just adding friction to the composability of EKS on the overall AWS platform.
This should really just be a checkbox in the managed node group creation + an API option for terraform provider. Please implement that 🙏
Any update?
:rocket: As of July 8, 2024, every new EKS managed node group has EC2 Autoscaling Group metrics enabled. If you want to enable Autoscaling Group metrics on existing EKS managed node groups, you can do so via the EC2 Autoscaling Group APIs or Console [1], or you can create new, equivalent EKS managed node groups. Read more about EKS managed node groups and EC2 Autoscaling Groups metrics here [1], [2].
[1] : https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-metrics.html [2] : https://docs.aws.amazon.com/eks/latest/userguide/create-managed-node-group.html
Request Add an option in the managed node groups to enable Group Metrics Collection for the created ASG
Which service(s) is this request for? EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I'm trying to collect more metrics to have a good overview on the instances in service and keep track of the recreated nodes
Are you currently working around this issue? We enable the metrics collection for the groups manually after they are created
Description The managed node groups create an ASG which is fully managed by the node groups and have Group Metrics Collection disabled by default. This is to enable more enhanced monitoring