aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 319 forks source link

[ECS] [Add Container-level Metrics]: Add Container-Level CPU & Memory metrics #885

Open rehevkor5 opened 4 years ago

rehevkor5 commented 4 years ago

Community Note

Tell us about your request CloudWatch metrics should include container-level metrics for CPU and Memory use (for each replica). Ideally this would be queryable by Service, and visualized with one graph line for each container+replica.

Which service(s) is this request for? ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? When tuning CPU & memory requests, or diagnosing issues with containers getting killed due to going over memory limits, it's necessary to determine which of the several possible containers in a Task is having issues by understanding the CPU & memory use of each container.

Currently, it's impossible to tell how much CPU & Memory a specific Container is using, so it's impossible to tell which Container in a Task might be going over its memory limit, or which Container might benefit from more CPU. CloudWatch only shows statistics which are generated at the Task+Replica level and only queriable as a summary by Service. The summary metric is misleading because it might show that only 50% of memory is being used (max per instant across replicas), but in actuality one Container might be using >100% of its memory, while another container might be using <10%.

Are you currently working around this issue? Trial-by fire and experimentation... Launch the service, observe if possible, including SSH into specific EC2 to look at output of stuff like docker ps and docker stats (not a very reliable procedure as often by the time I've logged in, the container is already killed), make a guess adjustment to the ECS configuration, launch it again, repeat until stuff happens to work.

Additional context None.

Attachments None.

sharanyad commented 4 years ago

@rehevkor5 ECS Container level metrics are available as part of Container Insights feature https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-ECS.html . Drilling further into task level metrics shows individual containers' CPU and memory resources consumed. Is there anything else that you're looking for?

tomelliff commented 4 years ago

@sharanyad none of those seem to have a dimension based on task like the EKS container insights has around pods (eg pod_cpu_utilization in https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-EKS.html).

I do notice that I can see a list of containers in the "Container performance" part of the performance monitoring part of Cloudwatch insights for ECS tasks. Unfortunately it's hard to link that back to a specific task (there's no task ID that I can see there) and it also only shows the average CPU and memory usage across the duration of that task instead of a graph showing resource usage over time.

Am I missing something there? It feels very close to showing what I need but not quite there.

We're considering using Metricbeat to ship the data from docker stats (https://www.elastic.co/guide/en/beats/metricbeat/current/exported-fields-docker.html) but would prefer a more out of the box experience.

raags commented 4 years ago

ECS metrics out of the box are not there yet. But detailed metrics are available via the ECS Metadata stats URL (ECS_CONTAINER_METADATA_URI env var), that provides stats for fargate tasks as well. We built a side-car to export these metrics: https://github.com/Spaced-Out/ecs-container-exporter

talawahtech commented 3 years ago

A few additional issues with the current container-level metrics available from Container Insights:

1) No cpu data is reported at the task or container level unless you specify a cpu reservation/limit for the task/container. This makes things a lot less useful for those who don't want/need to set cpu reservations/limits. It would be nice if a sensible default was used for calculating the cpu usage percentage when no reservation is specified. E.g. # either 1024 (which would match docker stats output) or vCPUs X 1024 (which should match total usage).

2) The display of the container level data is inconsistent for me. It doesn't automatically show up when I view the ECS Tasks dashboard, it seems like I have to wait 30-60s for it to show up. Also, there is no indication of what period of time the average memory and cpu usage is captured for.

Edit: Ignore the second issue, it seems to be environment specific, so it is probably just a software bug.

spoilgo commented 2 years ago

Would like to clarify, the ContainerInsights enables metrics dimension listed on: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-ECS.html

* TaskDefinitionFamily, ClusterName

* ServiceName, ClusterName

* ClusterName

However, that doesn't come with dimension like ContainerName or so. Please note that a ServiceName here may come with multiple containers internally, while we do want to look into each inner container's performance. We currently haven't set cpu reservations/limits, my understanding is even with cpu reservations/limits setup, we won't get per container level CPU metrics (to be displayed on a timeseries graph in cloudwatch)

Please correct me if my understanding is wrong.

If my understanding is correct, then the existing Cloudwatch ContainerInsight is not the solution for this issue. We need either a different solution provided by AWS or an enhanced ContainerInsight to have ContainerName dimension.

jameselderxe commented 2 years ago

We really need these container level metrics for Fargate in order to effectively monitor our applications. There are instances where a container could be using 100% of it's allocated CPU but the task CPU usage only shows 30% usage.

Currently we have no way of knowing/monitoring this.

The docs by @sharanyad suggest it's there for EC2 backed ECS but there's no Container ID dimension on the Fargate backed ECS metrics.

mreferre commented 2 years ago

I am joining late this thread. To overcome some of the limitations mentioned in this thread I built this custom dashboard: https://github.com/mreferre/container-insights-custom-dashboards/tree/master/fargate-right-sizing. It was meant for "right sizing" but it could obviously be useful for other use cases.

Note that it digs up until task-level granularity (to overcome the default Container Insights task-definition-level granularity). I did not go all the way to container-level granularity (and honestly I don't even remember if that was because it was not available when I built this - 2+ years ago or I did not deem it necessary at the time). However container-level granularity should be possible today (and it seems to be it's possible to correlate containers to task_ids contrary to what someone was alluding to? or am I missing something?).

HTH.

jameselderxe commented 2 years ago

@mreferre The issue is that container level metrics are only available for EC2 backed ECS and not Fargate, it's documented here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-ECS.html

Container Level Metrics

The other metrics in the docs, in the table above that one, don't have the Container ID dimension on them

mreferre commented 2 years ago

Yes, but those are metrics. The link I pasted above talks about the "performance log events" which seem to include container level numbers:

image

So you won't be able to set alarms on those or do anything you'd do with metrics BUT you can generate dashboards and extract meaningful information using Cloudwatch Log Insights (see the GH repo I linked above as an example)

wsscc2021 commented 2 years ago

+1 I want to can see any metrics per task id at container insight.

sblack4 commented 2 years ago

You can see this stuff, although you have to use cloudwatch insights to pull data logged by container insights.

Here's an example query to summarize a cluster/container's cpu and memory usage:

fields @message
| filter Type="Container"
| filter @logStream like /FargateTelemetry/
| stats  latest(ClusterName) as Cluster, max(CpuReserved) as MaxCpuReserved, avg(CpuUtilized) as AvgCpuUtilized, max(CpuUtilized) as PeakCpuUtilized, ceil(avg(MemoryUtilized)) as AvgMemUtilized, max(MemoryUtilized) as PeakMemUtilized by ContainerName
| sort ContainerName asc
jameselderxe commented 2 years ago

Unfortunately that query doesn’t appear to give the memory usage per container but instead gives the memory usage at the service/cluster level.

This is apparent if you are running multiple containers in a task and multiple tasks as you’ll see a memory utilised percentage above 100%.