[EKS] [request]: EKS Control Plane Metrics Available In CloudWatch

crhuber commented 4 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request

In some scenarios it is useful for Kubernetes operators to know the health of the EKS control plane. Some applications or pods may overload the control plane and it can be helpful to know this. Having control plane metrics in cloudwatch such as:

apiserverRequestCount
apiserverRequestErrCount
apiserverLatencyBucket
kubeNodes
kubePods

can help customers diagnosing slowness or unresponsiveness to the control plane

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Sometimes if the control plane is slow we would like to know if there has been a spike in requests to the API, is there a spike in amount of errors. Did we have a spike in new pods .

Are you currently working around this issue? Scraping the /metrics endpoint on the Kubernetes service

mchene commented 4 years ago

Hey everyone, I’m a Product Manager for CloudWatch. We are looking for people to join our beta program to provide feedback and test Prometheus metric monitoring in CloudWatch. The beta program will allow you to test the collection of the EKS Control Plane Metrics exposed as Prometheus metrics. Email us if interested, containerinsightsbetafeedback@amazon.com.

starchx commented 3 years ago

Can we include the cluster component status into the CloudWatch as well, for example:

kube controller manager
scheduler (http://localhost:8001/api/v1/componentstatuses/scheduler)

These can be used to set up CloudWatch alarm when a custom webhook breaks the component, for example, a newly installed ValidatingWebhook that breaks the Scheduler renew lease calls.

tpsk-hub commented 3 years ago

@starchx - https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-cloudwatch-monitors-prometheus-metrics-container-environments/. You can use CloudWatch Prometheus agent to support above use case. In the first phase (already available), we encourage you to configure the agent to consume Control plane metrics for EKS and leverage CloudWatch alarms. In the second phase, we will also build an automated and out of the box dashboard for EKS Control plane. Check out this workshop to learn more: https://observability.workshop.aws/en/containerinsights/eks/_prometheusmonitoring.html

kespinola commented 3 years ago

I don't mind scraping the endpoints myself since I use Datadog for monitoring but not having access to the schedulers or control plane manger metrics endpoint is tough. For example, without access to the kube scheduler my team and I are unable to track "time to schedule a pod" which is a key service level indicator for us.

https://github.com/DataDog/integrations-core/blob/master/kube_scheduler/datadog_checks/kube_scheduler/kube_scheduler.py#L41

rohitkothari commented 2 years ago

Since it's been 8 months since @kespinola asked about kube scheduler metrics, checking back on the same here with AWS. Are there any plans on exposing kube-scheduler metrics?

It looks like Container Insights Metrics and the Control Plane Metrics for EKS do not yet expose metrics from kube-scheduler.

frimik commented 2 years ago

One of the most important metrics of them all, e2e_scheduling_duration_seconds, is not available. Can we please somehow get access to the scheduler metrics?

PrayagS commented 2 years ago

My team is also trying to fetch and perform analysis on the metrics being reported by kube-scheduler. Can you folks please update us with the proposed timeline for this feature? At least add this component in the feature request since a lot of folks need this as evident from the comments.

sumanthkumarc commented 2 years ago

+1 for exposing important metrics of control plane, especially kube-scheduler helps us understand the overall scheduling latency, useful especially when we have nodegroups mixed with on-demand, spot instances. the metrics scheduler_pod_scheduling_duration_seconds would be useful in these use cases.

yuvraj9 commented 2 years ago

We would love to have kube scheduler metrics available so we can scrape via Prometheus.

mikestef9 commented 2 years ago

We are looking into this, are there any other ones of interest besides the ones mentioned already?

scheduler_pod_scheduling_duration_seconds e2e_scheduling_duration_seconds

kr3cj commented 2 years ago

I'm not sure if this is in scope for the EKS control plane metrics. But we currently get all the kube_apiserver.* metrics from EKS into Datadog via a custom helm chart post-install-hook (hook runs a kubectl patch svc/kubernetes command to add datadog annotations which allows our datadog-clusterchecks deployment to grab the metrics from apiserver). But it'd be nice if we could get them natively from cloudwatch instead.

elementalvoid commented 2 years ago

Perhaps also:

scheduling_attempt_duration_seconds
pod_scheduling_attempts
pending_pods
scheduling_algorithm_duration_seconds

It may be worth noting that e2e_scheduling_duration_seconds has been replaced by scheduling_attempt_duration_seconds. The former is marked as Alpha status while the latter is considered Stable.

In a slightly different approach than listing individual metric names. Might I suggest that all STABLE metrics be made available?

I'm not sure if Alpha level metrics should have the same treatment. All of the requested metrics except scheduling_algorithm_duration_seconds (which only I have mentioned above) are Stable metrics.

Here is the current list of all Stable metrics:

framework_extension_point_duration_seconds
pending_pods
pod_scheduling_attempts
pod_scheduling_duration_seconds
preemption_attempts_total
preemption_victims
queue_incoming_pods_total
schedule_attempts_total
scheduling_attempt_duration_seconds

And all Alpha metrics:

e2e_scheduling_duration_seconds
permit_wait_duration_seconds
plugin_execution_duration_seconds
scheduler_cache_size
scheduler_goroutines
scheduling_algorithm_duration_seconds

djmcgreal-cc commented 1 year ago

@vipin-mohan or others any progress on this? Specifically we need kube scheduler metrics, specifically kube_pod_resource_request

PS k8s docks talk about the scheduler metrics being available at an api endpoint /metrics/resources. EKS talks about /metrcs. Could EKS just expose the scheduler metrics through something similar, via a raw K8s API?

aws / containers-roadmap

[EKS] [request]: EKS Control Plane Metrics Available In CloudWatch #800

Community Note