[EKS] [Fargate][Container Insights]: Make container insights available to EKS Fargate clusters

themish95 commented 4 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request Making container insights available to EKS Fargate pods

Which service(s) is this request for? EKS Fargate

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I have had multiple requests from customers to have an way of viewing metrics and logs for their Fargate EKS pods.

Are you currently working around this issue? Adding a sidecar container of log forwarder (e.g. fluentd) which can forward logs to the required destination (i.e. CloudWatch). For that, one would need to modify the existing pod manifest and add a log forwarder container

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

casret commented 4 years ago

I've seen the blog post about the fluentd sidecar, but what are people doing about the pod metrics?

mreferre commented 4 years ago

I've seen the blog post about the fluentd sidecar, but what are people doing about the pod metrics?

We are working towards supporting Container Insights with EKS/Fargate but in the meanwhile we have also documented how to configure Prometheus/Grafana to monitor EKS/Fargate in this blog post.

smailc commented 4 years ago

I've seen the blog post about the fluentd sidecar, but what are people doing about the pod metrics?

We are working towards supporting Container Insights with EKS/Fargate but in the meanwhile we have also documented how to configure Prometheus/Grafana to monitor EKS/Fargate in this blog post.

The doc looks interesting but we have a Fargate only EKS cluster which we would like to monitor. I don't really want to spin up EC2 worker nodes to monitor Fargate nodes. What can we do to monitor a fargate-only cluster?

mreferre commented 4 years ago

@smailc thanks for the heads up. This blog needs an update because the part "The cluster will need a worker node backed by EC2 since Prometheus requires a persistent volume to store data and EKS on Fargate currently doesn’t support persistent storage" is no longer true. We have recently announced EKS/Fargate support for EFS which can be used as an alternative "local storage" for Prometheus (or what Prometheus consider being "local storage"). Probably a proper architecture based on "local storage" backup and/or Prometheus remote storage should be considered (regardless of the storage type being used).

smailc commented 4 years ago

@smailc thanks for the heads up. This blog needs an update because the part "The cluster will need a worker node backed by EC2 since Prometheus requires a persistent volume to store data and EKS on Fargate currently doesn’t support persistent storage" is no longer true. We have recently announced EKS/Fargate support for EFS which can be used as an alternative "local storage" for Prometheus (or what Prometheus consider being "local storage"). Probably a proper architecture based on "local storage" backup and/or Prometheus remote storage should be considered (regardless of the storage type being used).

Ok great thanks i'll give that a try! I'll also keep a look out for the blog updates. Thanks

dennis-menge commented 3 years ago

@smailc thanks for the heads up. This blog needs an update because the part "The cluster will need a worker node backed by EC2 since Prometheus requires a persistent volume to store data and EKS on Fargate currently doesn’t support persistent storage" is no longer true. We have recently announced EKS/Fargate support for EFS which can be used as an alternative "local storage" for Prometheus (or what Prometheus consider being "local storage"). Probably a proper architecture based on "local storage" backup and/or Prometheus remote storage should be considered (regardless of the storage type being used).

Ok great thanks i'll give that a try! I'll also keep a look out for the blog updates. Thanks

I also tried that but then Prometheus gets unhealthy due to consistency issues, ref.: https://github.com/prometheus/prometheus/issues/5617

mreferre commented 3 years ago

Thanks for the heads up. Yes, NFS support in Prometheus is a bit vague.... . Oh and I have just seen they added a note specific to this (as of 12/8/2020 this is what it says):

sukrit007 commented 3 years ago

I've seen the blog post about the fluentd sidecar, but what are people doing about the pod metrics?

We are working towards supporting Container Insights with EKS/Fargate but in the meanwhile we have also documented how to configure Prometheus/Grafana to monitor EKS/Fargate in this blog post.

I am in the middle of a decision-making process whether to pursue Fargate or switch to EKS Node groups. For the future proposed solution, would there be a need for maintaining our own VMs or would the solution be out of box solution that automatically integrates with cloud watch?

brgaulin commented 3 years ago

The workaround mentions a log forwarder. Log forwarding we saw was solved in an update recently, for us the remaining missing component is the metrics. How many nodes are there, what is their CPU/Memory usage. Being able to easily get that information into cloudwatch would be ideal.

sandan commented 3 years ago

If your pods are instrumented to export prometheus metrics, then you can use "Container Insights Prometheus Metrics Monitoring". According to the Cloudwatch docs:

For Amazon ECS and Amazon EKS clusters, both the EC2 and Fargate launch types are supported.

You can adopt Prometheus as an open-source and open-standard method to ingest custom metrics in CloudWatch. The CloudWatch agent with Prometheus support discovers and collects Prometheus metrics to monitor, troubleshoot, and alarm on application performance degradation and failures faster.

You can add container names to a config map (prometheus-cwagentconfig) that the cloudwatch agent uses. It has some default metrics that it scrapes for.

vijaynikam2211 commented 3 years ago

Is there any recent Updates from AWS on Metrics Monitoring for EKS Fargate using Cloudwatch Container Insights? What's the Best option AWS can suggest in that case?

vijaynikam2211 commented 3 years ago

I've seen the blog post about the fluentd sidecar, but what are people doing about the pod metrics?

We are working towards supporting Container Insights with EKS/Fargate but in the meanwhile we have also documented how to configure Prometheus/Grafana to monitor EKS/Fargate in this blog post.

Any update on this Please? When can we expect Container Insights support for EKS/Fargate? We mainly want to monitor Farget Pods Health (CPU/Memory/Disk/Network) and able to set Alarm to it..

pvieito commented 3 years ago

Big +1 to this addition.

Any update on it?

lennartt commented 3 years ago

+1 we also need this feature

visit1985 commented 3 years ago

If your pods are instrumented to export prometheus metrics, then you can use "Container Insights Prometheus Metrics Monitoring". According to the Cloudwatch docs:

For Amazon ECS and Amazon EKS clusters, both the EC2 and Fargate launch types are supported. You can adopt Prometheus as an open-source and open-standard method to ingest custom metrics in CloudWatch. The CloudWatch agent with Prometheus support discovers and collects Prometheus metrics to monitor, troubleshoot, and alarm on application performance degradation and failures faster.

You can add container names to a config map (prometheus-cwagentconfig) that the cloudwatch agent uses. It has some default metrics that it scrapes for.

Here is an example based on the eks fargate deployment manifest mentioned by @sandan.

This will scrape the cadvisor metrics and push them to cloudwatch.

Be aware of the optional metric_relabel_configs, which:

drops metrics for namespaces we do not want to monitor
removes the uid suffixes from the pod dimension, so we can easily group workload by deployment.

@@ -170,6 +170,23 @@
                   "metric_selectors": [
                     "^jvm_memory_pool_bytes_used$"
                   ]
+                },
+                {
+                  "source_labels": ["job"],
+                  "label_matcher": "^kubernetes-nodes-cadvisor$",
+                  "dimensions": [["ClusterName","namespace","pod","container"]],
+                  "metric_selectors": [
+                    "^container_cpu_usage_seconds_total$",
+                    "^container_memory_usage_bytes$"
+                  ]
+                },
+                {
+                  "source_labels": ["job"],
+                  "label_matcher": "^kubernetes-nodes-cadvisor$",
+                  "dimensions": [["ClusterName","namespace","pod"]],
+                  "metric_selectors": [
+                    "^container_network_(receive|transmit)_bytes_total$"
+                  ]
                 }
               ]
             }
@@ -193,6 +210,32 @@
       scrape_interval: 1m
       scrape_timeout: 10s
     scrape_configs:
+    - job_name: kubernetes-nodes-cadvisor
+      sample_limit: 10000
+      kubernetes_sd_configs:
+      - role: node
+      relabel_configs:
+      - replacement: kubernetes.default.svc:443
+        target_label: __address__
+      - source_labels: [__meta_kubernetes_node_name]
+        regex: (.+)
+        replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
+        target_label: __metrics_path__
+      - action: labelmap
+        regex: __meta_kubernetes_node_label_(.+)
+      metric_relabel_configs:
+      - source_labels: [namespace]
+        action: drop
+        regex: ^(amazon-cloudwatch|kube-system)$
+      - source_labels: [pod]
+        regex: (.+)(-\w+-\w+)|(.+)(-\w+)|(.+)
+        replacement: ${1}${3}${5}
+        target_label: pod
+      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
+      scheme: https
+      tls_config:
+        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
+        insecure_skip_verify: true
     - job_name: 'kubernetes-pod-appmesh-envoy'
       sample_limit: 10000
       metrics_path: /stats/prometheus

JasperW01 commented 3 years ago

Any recent update on this? Now EKS + Fargate logging native support is solved properly and this is the last hurdle for one of my major customers to adopt EKS Fargate!

vijaynikam2211 commented 3 years ago

@visit1985 @sandan I was able to configure "prometheus-eks-fargate.yaml" [https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/service/cwagent-prometheus/prometheus-k8s.yaml](cloudwatch agent).

However, those metrics are coming under Metrics section in Cloudwatch and our expectation to come in Container Insights section (performance monitoring) which has actual visualization/monitoring. Also, I have seen all expected info in Container Insights but only when your Nodes/Pods running on EC2.

We should see everything in "Container Insight" tab of Cloudwatch instead on "Metrics" tab for EKS Fargate.

pdoshi2265 commented 3 years ago

We also prefer Fargate and don’t want to break the whole purpose of PaaS which means we should not have to manage servers (IaaS).

jasonumiker commented 2 years ago

It isn't CloudWatch Container Insights, but I managed to get a Fargate-only EKS cluster to ship to Amazon Managed Service for Prometheus (AMP) in the last update to our AWS EKS Quick Start using CDK. I have a short retention (1hr) Prometheus like in the blog post above - but changed it to not use an EBS persistent volume but instead use an ephemeral emptyDir so it works on Fargate. This should be safe as it immediately ships to the AMP managed service outside of the cluster so any lost metrics should be minimal. https://github.com/aws-quickstart/quickstart-eks-cdk-python/blob/main/cluster-bootstrap/eks_cluster.py#L1487

kingsleykumar commented 2 years ago

The easiest option seems to be deploying cwagent-prometheus to scrape prometheus metrics from fargate cadvisor (as mentioned by @visit1985) and streaming it to cloudwatch. Once metrics are available in cloudwatch, it is easy to create custom dashboards out of that. Deploying kube-state-metrics provide additional metrics such as kube_pod_info, kube_pod_container_status_restarts_total, kube_pod_container_resource_requests etc. If control plane metrics is required, it can be configured too in cwagent. I think that is sufficient to construct some basic useful dashboards until out-of-the-box container insights support is available. The prometheus grafana approach is good too, but it requires managing many things( prometheus stack, storage, external access etc) and a worker node is required to run those. It is much easier to achieve basic monitoring through cwagent-prometheus approach.

looorent commented 2 years ago

After struggling on this subject, we have adopted another alternative based on Prometheus Federation.

Original problem

We own a main Prometheus instance, deployed on another infrastructure. This instance is able to scrape CloudWatch logs (RDS, S3, MQ, etc), but EKS metrics are missing. Our original need was to also scrape the EKS/Pods/Containers metrics from CloudWatch. However, pushing EKS Fargate metrics to CloudWatch is a real pain.

Solution

Instead of consuming EKS/Pods/Containers metrics from CloudWatch, we have decided to:

Deploy 1 pod of prometheus-server and 1 pod of prometheus-kube-state-metrics on each EKS cluster (using the prometheus-community/prometheus Helm charts). The other Prometheus services are not required. This server is responsible for gathering metrics from cAdvisor and prometheus-kube-state-metrics.
Expose the prometheus-server service to make it available by the main Prometheus instance
On the main Prometheus, configure a scraper to federate the child prometheus metrics.

This main Prometheus now gets all the metrics from the EKS Cluster.

Disable useless services

The Helm chart prometheus-community/prometheus deploys the whole set of basic services. This can be overridden with this configuration:

alertmanager:
  enabled: false

pushgateway:
  enabled: false

[...]

Storage

Deploying a stateful prometheus-server on EKS Fargate is not trivial. We have decided to deploy a stateless Prometheus server. Restarting the server means losing data that have not been federated yet.

Using Helm chart, this can be achieved by overriding this value:

server:
  persistentVolume:
    enabled: false

Resources

Deploying prometheus-server on EKS Fargate without specifying the required resources always leads to an instance 0.25vCPU,0.5GB, which makes the server fails after a few minutes (the node gets NotReady, for some reason). Overriding the Helm values with this configuration is sufficient for us:

server:
  resources:
    limits:
      cpu: 0.5
      memory: 700Mi
    requests:
      cpu: 0.5
      memory: 700Mi

Security

This data federation should be secured, for instance with:

EC2 Security Groups for the prometheus pods
Enabling basic_auth on the child prometheus-server (pay attention you must also configure the liveness/readiness checks to use an Authorization header, otherwise the pod will restart over and over again)
IP filtering between the Prometheus instances

vaibhavkhunger commented 2 years ago

Today, Amazon CloudWatch Container Insights adds metric collection support for your applications running on Amazon Elastic Kubernetes Service (EKS) with AWS Fargate using AWS Distro for OpenTelemetry (ADOT). ADOT is a secure, AWS-supported distribution of the OpenTelemetry project. Customers can now easily collect EKS Fargate metrics, such as CPU, memory, disk, and network, and analyze them along with other container metrics in Amazon CloudWatch. This helps customers observe the performance and resource utilization of their applications directly in the CloudWatch Container Insights console.

https://aws.amazon.com/blogs/containers/introducing-amazon-cloudwatch-container-insights-for-amazon-eks-fargate-using-aws-distro-for-opentelemetry/ https://aws.amazon.com/about-aws/whats-new/2022/02/amazon-cloudwatch-eks-fargate-distro-opentelemetry/

carsondoesbusiness commented 2 years ago

@vaibhavkhunger Thanks to you and the team for adding this functionality!

I would really like to see support added for pod_number_of_container_restarts as it is one of the most important metrics to monitor for crash looping pods.

aws / containers-roadmap