DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.86k stars 1.2k forks source link

Missing values in datadog-agent on ECS Fargate platform_version 1.4.0 due to missing task metadata values #5403

Open J-Mx opened 4 years ago

J-Mx commented 4 years ago

Output of the info page (if this is a bug) N/A

Describe what happened: While trying to monitor my task with Datadog Agent, everything works fine with platform_version 1.3.0 , but when I upgrade to 1.4.0 , some metrics from task metadata endpoint disappear. When I go deeper in datadog agent code , it seems that the task metadata endpoint does not return anymore these values :

system_cpu_usage
online_cpus

You could see a sample with a task definition metadata from 1.4.0 on the left and a task definiton metadata 1.3.0 on the right here

So ecs.fargate.cpu.percent is not compute anymore with platform_version 1.4.0 of fargate.

Describe what you expected: Get ecs.fargate.cpu.percent value as expected. I Open a ticket on ECS Roadmap Project here

Steps to reproduce the issue: Launch a task with platform_version 1.4.0

Additional environment details (Operating System, Cloud provider, etc): AWS Fargate with platform_version 1.4.0 and 1.3.0 Datadog agent version 7.18.1

jdreaver commented 4 years ago

I'm seeing the same thing. All I changed for my ECS Fargate service was change the platform version to 1.4.0 and I no longer ecs.fargate.cpu.percent.

jdreaver commented 4 years ago

Does this PR https://github.com/DataDog/datadog-agent/pull/5411 fix this issue?

jdreaver commented 4 years ago

FYI I just upgraded to version 7.19.0 https://github.com/DataDog/datadog-agent/releases/tag/7.19.0 and I'm still seeing this issue.

fernst commented 4 years ago

+1. I just upgraded to Fargate version 1.4.0 (From 1.3.0) and I can confirm that the ecs.fargate.cpu.percent is missing. All other values seem to be working as expected. I'm using the datadog/agent:latest image.

xornivore commented 4 years ago

@DataDog/container-integrations team here. FYI, we are still investigating this issue but can confirm the findings that exhibit the 1.4.0 vs 1.3.0 difference:

Same container image reports these stats:

Fargate 1.4.0:

'cpu_stats': {'cpu_usage': {'total_usage': 399244944, 'percpu_usage': [367286436, 31958508, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'usage_in_kernelmode': 30000000, 'usage_in_usermode': 360000000}, 'throttling_data': {'periods': 0, 'throttled_periods': 0, 'throttled_time': 0}}

Fargate 1.3.0:

 'cpu_stats': {'cpu_usage': {'total_usage': 431866773, 'percpu_usage': [222238847, 209627926, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'usage_in_kernelmode': 30000000, 'usage_in_usermode': 340000000}, 'system_cpu_usage': 522320000000, 'online_cpus': 2, 'throttling_data': {'periods': 0, 'throttled_periods': 0, 'throttled_time': 0}}

We will update once we know more and if there's a workaround we can offer.

To clarify re: PR #5411 - it fixes a different issue, where JSON format from /v2/stats/{container_id} API changes across these versions.

ken5scal commented 4 years ago

I'm experiencing the same issue

fpetrini commented 4 years ago

Any updates on this issue? This is starting to become critical for us as we need to move to ECS Fargate 1.4.0.

ikedam commented 4 years ago

This is caused by the regression of fargate1.4.0 as in the description. cpu_stats and precpu_stats are incomplete. The ticket opened in the description (aws/containers-roadmap#855) was closed as fixed, but the fix was a partial one. Only cpu_stats was fixed and precpu_stats is still incomplete. cpu_stats reports accumulated values and datadog-agent doesn’t work without precpu_stats fixed.

The ticket for precpu_stats of Fargate 1.4.0 is here: aws/containers-roadmap#1062

bchew commented 3 years ago

Received an update from Datadog support 2 weeks ago on this - we upgraded back to Fargate PV 1.4.0 with Datadog agent version 7.23.1 and ecs.fargate.cpu.percent has been present since then.

Gowiem commented 1 year ago

I seem to be running into something similar to this in 2023... It's confusing to see this issue still open and no updates since 2020.

I imagine my issue looking similar is likely a mistake (or at least I hope). Can someone from the datadog weigh in on this issue and close it out if so?