DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.88k stars 1.21k forks source link

Adjusting tag cardinality can cause duplicate metrics #5904

Open spacez320 opened 4 years ago

spacez320 commented 4 years ago

Describe what happened:

We've seen it be the case that when we adjust Tag cardinality (adding a new Tag, for example), two metric series which logically represent the same data can coexist for a short period of time (probably the length of time to collect new data) and cause counting issues in Monitors or Dashboards.

For example, I have some metric in the Kubernetes integration kubernetes.node.whatever, and I want to add a new nodeLabelAsTag setting foo=bar. I push that configuration change to the Agent, and the following happens:

Before push, metrics look like this:

kubernetes.node.whatever{fizz=buzz} 1

After push, they look like this for a few minutes:

kubernetes.node.whatever{fizz=buzz} 1
kubernetes.node.whatever{fizz=buzz, foo=bar} 1

Then eventually they become:

kubernetes.node.whatever{fizz=buzz, foo=bar} 1

The problem with this is that anything that tries to sum:kubernetes.node.whatever{fizz=buzz} will temporarily see "2" when really it should be "1".

Describe what you expected:

I'm not sure what's possible, but it would be nice if we could teach Monitor or Datadog metrics to deduplicate accessory cardinality. Currently it's the case that changes like this can cause our Monitors to go haywire, and we would like to edit Tags while avoiding this.

Steps to reproduce the issue:

  1. Create a metric.
  2. Create a Monitor that sums the metric on an existing Tag.
  3. Add a Tag on that metric.
  4. Observe that the Monitor tracking value jumps because it's seeing more data.

Additional environment details (Operating System, Cloud provider, etc):

Seems like a pretty general problem.

programmer04 commented 2 years ago

Hello, @spacez320 I faced a similar issue, I wanted to discard some tags to avoid duplication of metrics that represent the same logical value (I wanted to get rid of tag host, because the metric was service level, but was reported by every replica). Distributions metric type helps me because it allows select tags to retain. You can read more https://docs.datadoghq.com/metrics/distributions/

But it's still not an ideal solution for everything, for instance how to have a Gauge metric that ignores some tags?

BEvgeniyS commented 6 months ago

We have same problem with kubernetes.memory.request is doubled for each new value of container_id (for example if container is restarted). It takes way too long for agent to remove old container_id tag, and during that period, the sum metric is doubled

What we need is ability to remove tag from consideration completely