Closed themish95 closed 2 years ago
I've seen the blog post about the fluentd sidecar, but what are people doing about the pod metrics?
I've seen the blog post about the fluentd sidecar, but what are people doing about the pod metrics?
We are working towards supporting Container Insights with EKS/Fargate but in the meanwhile we have also documented how to configure Prometheus/Grafana to monitor EKS/Fargate in this blog post.
I've seen the blog post about the fluentd sidecar, but what are people doing about the pod metrics?
We are working towards supporting Container Insights with EKS/Fargate but in the meanwhile we have also documented how to configure Prometheus/Grafana to monitor EKS/Fargate in this blog post.
The doc looks interesting but we have a Fargate only EKS cluster which we would like to monitor. I don't really want to spin up EC2 worker nodes to monitor Fargate nodes. What can we do to monitor a fargate-only cluster?
@smailc thanks for the heads up. This blog needs an update because the part "The cluster will need a worker node backed by EC2 since Prometheus requires a persistent volume to store data and EKS on Fargate currently doesn’t support persistent storage" is no longer true. We have recently announced EKS/Fargate support for EFS which can be used as an alternative "local storage" for Prometheus (or what Prometheus consider being "local storage"). Probably a proper architecture based on "local storage" backup and/or Prometheus remote storage should be considered (regardless of the storage type being used).
@smailc thanks for the heads up. This blog needs an update because the part "The cluster will need a worker node backed by EC2 since Prometheus requires a persistent volume to store data and EKS on Fargate currently doesn’t support persistent storage" is no longer true. We have recently announced EKS/Fargate support for EFS which can be used as an alternative "local storage" for Prometheus (or what Prometheus consider being "local storage"). Probably a proper architecture based on "local storage" backup and/or Prometheus remote storage should be considered (regardless of the storage type being used).
Ok great thanks i'll give that a try! I'll also keep a look out for the blog updates. Thanks
@smailc thanks for the heads up. This blog needs an update because the part "The cluster will need a worker node backed by EC2 since Prometheus requires a persistent volume to store data and EKS on Fargate currently doesn’t support persistent storage" is no longer true. We have recently announced EKS/Fargate support for EFS which can be used as an alternative "local storage" for Prometheus (or what Prometheus consider being "local storage"). Probably a proper architecture based on "local storage" backup and/or Prometheus remote storage should be considered (regardless of the storage type being used).
Ok great thanks i'll give that a try! I'll also keep a look out for the blog updates. Thanks
I also tried that but then Prometheus gets unhealthy due to consistency issues, ref.: https://github.com/prometheus/prometheus/issues/5617
Thanks for the heads up. Yes, NFS support in Prometheus is a bit vague....
I've seen the blog post about the fluentd sidecar, but what are people doing about the pod metrics?
We are working towards supporting Container Insights with EKS/Fargate but in the meanwhile we have also documented how to configure Prometheus/Grafana to monitor EKS/Fargate in this blog post.
I am in the middle of a decision-making process whether to pursue Fargate or switch to EKS Node groups. For the future proposed solution, would there be a need for maintaining our own VMs or would the solution be out of box solution that automatically integrates with cloud watch?
The workaround mentions a log forwarder. Log forwarding we saw was solved in an update recently, for us the remaining missing component is the metrics. How many nodes are there, what is their CPU/Memory usage. Being able to easily get that information into cloudwatch would be ideal.
If your pods are instrumented to export prometheus metrics, then you can use "Container Insights Prometheus Metrics Monitoring". According to the Cloudwatch docs:
For Amazon ECS and Amazon EKS clusters, both the EC2 and Fargate launch types are supported.
You can adopt Prometheus as an open-source and open-standard method to ingest custom metrics in CloudWatch. The CloudWatch agent with Prometheus support discovers and collects Prometheus metrics to monitor, troubleshoot, and alarm on application performance degradation and failures faster.
You can add container names to a config map (prometheus-cwagentconfig) that the cloudwatch agent uses. It has some default metrics that it scrapes for.
Is there any recent Updates from AWS on Metrics Monitoring for EKS Fargate using Cloudwatch Container Insights? What's the Best option AWS can suggest in that case?
I've seen the blog post about the fluentd sidecar, but what are people doing about the pod metrics?
We are working towards supporting Container Insights with EKS/Fargate but in the meanwhile we have also documented how to configure Prometheus/Grafana to monitor EKS/Fargate in this blog post.
Any update on this Please? When can we expect Container Insights support for EKS/Fargate? We mainly want to monitor Farget Pods Health (CPU/Memory/Disk/Network) and able to set Alarm to it..
Big +1 to this addition.
Any update on it?
+1 we also need this feature
If your pods are instrumented to export prometheus metrics, then you can use "Container Insights Prometheus Metrics Monitoring". According to the Cloudwatch docs:
For Amazon ECS and Amazon EKS clusters, both the EC2 and Fargate launch types are supported. You can adopt Prometheus as an open-source and open-standard method to ingest custom metrics in CloudWatch. The CloudWatch agent with Prometheus support discovers and collects Prometheus metrics to monitor, troubleshoot, and alarm on application performance degradation and failures faster.
You can add container names to a config map (prometheus-cwagentconfig) that the cloudwatch agent uses. It has some default metrics that it scrapes for.
Here is an example based on the eks fargate deployment manifest mentioned by @sandan.
This will scrape the cadvisor metrics and push them to cloudwatch.
Be aware of the optional metric_relabel_configs
, which:
@@ -170,6 +170,23 @@
"metric_selectors": [
"^jvm_memory_pool_bytes_used$"
]
+ },
+ {
+ "source_labels": ["job"],
+ "label_matcher": "^kubernetes-nodes-cadvisor$",
+ "dimensions": [["ClusterName","namespace","pod","container"]],
+ "metric_selectors": [
+ "^container_cpu_usage_seconds_total$",
+ "^container_memory_usage_bytes$"
+ ]
+ },
+ {
+ "source_labels": ["job"],
+ "label_matcher": "^kubernetes-nodes-cadvisor$",
+ "dimensions": [["ClusterName","namespace","pod"]],
+ "metric_selectors": [
+ "^container_network_(receive|transmit)_bytes_total$"
+ ]
}
]
}
@@ -193,6 +210,32 @@
scrape_interval: 1m
scrape_timeout: 10s
scrape_configs:
+ - job_name: kubernetes-nodes-cadvisor
+ sample_limit: 10000
+ kubernetes_sd_configs:
+ - role: node
+ relabel_configs:
+ - replacement: kubernetes.default.svc:443
+ target_label: __address__
+ - source_labels: [__meta_kubernetes_node_name]
+ regex: (.+)
+ replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
+ target_label: __metrics_path__
+ - action: labelmap
+ regex: __meta_kubernetes_node_label_(.+)
+ metric_relabel_configs:
+ - source_labels: [namespace]
+ action: drop
+ regex: ^(amazon-cloudwatch|kube-system)$
+ - source_labels: [pod]
+ regex: (.+)(-\w+-\w+)|(.+)(-\w+)|(.+)
+ replacement: ${1}${3}${5}
+ target_label: pod
+ bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
+ scheme: https
+ tls_config:
+ ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
+ insecure_skip_verify: true
- job_name: 'kubernetes-pod-appmesh-envoy'
sample_limit: 10000
metrics_path: /stats/prometheus
Any recent update on this? Now EKS + Fargate logging native support is solved properly and this is the last hurdle for one of my major customers to adopt EKS Fargate!
@visit1985 @sandan I was able to configure "prometheus-eks-fargate.yaml" [https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/service/cwagent-prometheus/prometheus-k8s.yaml](cloudwatch agent).
However, those metrics are coming under Metrics section in Cloudwatch and our expectation to come in Container Insights section (performance monitoring) which has actual visualization/monitoring. Also, I have seen all expected info in Container Insights but only when your Nodes/Pods running on EC2.
We should see everything in "Container Insight" tab of Cloudwatch instead on "Metrics" tab for EKS Fargate.
We also prefer Fargate and don’t want to break the whole purpose of PaaS which means we should not have to manage servers (IaaS).
It isn't CloudWatch Container Insights, but I managed to get a Fargate-only EKS cluster to ship to Amazon Managed Service for Prometheus (AMP) in the last update to our AWS EKS Quick Start using CDK. I have a short retention (1hr) Prometheus like in the blog post above - but changed it to not use an EBS persistent volume but instead use an ephemeral emptyDir so it works on Fargate. This should be safe as it immediately ships to the AMP managed service outside of the cluster so any lost metrics should be minimal. https://github.com/aws-quickstart/quickstart-eks-cdk-python/blob/main/cluster-bootstrap/eks_cluster.py#L1487
The easiest option seems to be deploying cwagent-prometheus
to scrape prometheus metrics from fargate cadvisor (as mentioned by @visit1985) and streaming it to cloudwatch. Once metrics are available in cloudwatch, it is easy to create custom dashboards out of that. Deploying kube-state-metrics
provide additional metrics such as kube_pod_info, kube_pod_container_status_restarts_total, kube_pod_container_resource_requests etc. If control plane metrics is required, it can be configured too in cwagent. I think that is sufficient to construct some basic useful dashboards until out-of-the-box container insights support is available. The prometheus grafana approach is good too, but it requires managing many things( prometheus stack, storage, external access etc) and a worker node is required to run those. It is much easier to achieve basic monitoring through cwagent-prometheus
approach.
After struggling on this subject, we have adopted another alternative based on Prometheus Federation.
We own a main Prometheus instance, deployed on another infrastructure. This instance is able to scrape CloudWatch logs (RDS, S3, MQ, etc), but EKS metrics are missing. Our original need was to also scrape the EKS/Pods/Containers metrics from CloudWatch. However, pushing EKS Fargate metrics to CloudWatch is a real pain.
Instead of consuming EKS/Pods/Containers metrics from CloudWatch, we have decided to:
prometheus-server
and 1 pod of prometheus-kube-state-metrics
on each EKS cluster (using the prometheus-community/prometheus
Helm charts). The other Prometheus services are not required. This server is responsible for gathering metrics from cAdvisor
and prometheus-kube-state-metrics
.prometheus-server
service to make it available by the main Prometheus instanceThis main Prometheus now gets all the metrics from the EKS Cluster.
The Helm chart prometheus-community/prometheus
deploys the whole set of basic services. This can be overridden with this configuration:
alertmanager:
enabled: false
pushgateway:
enabled: false
[...]
Deploying a stateful prometheus-server
on EKS Fargate is not trivial. We have decided to deploy a stateless Prometheus server. Restarting the server means losing data that have not been federated yet.
Using Helm chart, this can be achieved by overriding this value:
server:
persistentVolume:
enabled: false
Deploying prometheus-server
on EKS Fargate without specifying the required resources always leads to an instance 0.25vCPU,0.5GB
, which makes the server fails after a few minutes (the node gets NotReady
, for some reason). Overriding the Helm values with this configuration is sufficient for us:
server:
resources:
limits:
cpu: 0.5
memory: 700Mi
requests:
cpu: 0.5
memory: 700Mi
This data federation should be secured, for instance with:
basic_auth
on the child prometheus-server
(pay attention you must also configure the liveness/readiness checks to use an Authorization header, otherwise the pod will restart over and over again)Today, Amazon CloudWatch Container Insights adds metric collection support for your applications running on Amazon Elastic Kubernetes Service (EKS) with AWS Fargate using AWS Distro for OpenTelemetry (ADOT). ADOT is a secure, AWS-supported distribution of the OpenTelemetry project. Customers can now easily collect EKS Fargate metrics, such as CPU, memory, disk, and network, and analyze them along with other container metrics in Amazon CloudWatch. This helps customers observe the performance and resource utilization of their applications directly in the CloudWatch Container Insights console.
https://aws.amazon.com/blogs/containers/introducing-amazon-cloudwatch-container-insights-for-amazon-eks-fargate-using-aws-distro-for-opentelemetry/ https://aws.amazon.com/about-aws/whats-new/2022/02/amazon-cloudwatch-eks-fargate-distro-opentelemetry/
@vaibhavkhunger Thanks to you and the team for adding this functionality!
I would really like to see support added for pod_number_of_container_restarts
as it is one of the most important metrics to monitor for crash looping pods.
Community Note
Tell us about your request Making container insights available to EKS Fargate pods
Which service(s) is this request for? EKS Fargate
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I have had multiple requests from customers to have an way of viewing metrics and logs for their Fargate EKS pods.
Are you currently working around this issue? Adding a sidecar container of log forwarder (e.g. fluentd) which can forward logs to the required destination (i.e. CloudWatch). For that, one would need to modify the existing pod manifest and add a log forwarder container
Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)