DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.88k stars 1.21k forks source link

Please unify cloud provider tags #7396

Open tonglil opened 3 years ago

tonglil commented 3 years ago

Describe what happened:

Currently, the tags Datadog collects and applies to cloud resources are 1) inconsistent and 2) varied.

1. For example, there is no convention or standardization between - and _

Please unify them on one or the other.

For GCP, these are the tags collected:

Zone
Instance-type
Instance-id
Automatic-restart
On-host-maintenance
Numeric_project_id

For AWS:

# EC2
autoscaling_group
availability-zone
instance-id
instance-type
security_group_name

In this case, there is

  1. a difference between zone (GCP) and availability-zone (AWS), causing duplication & multiplication of tags (and therefore custom metrics) if you want to unify them across clouds (ie one app deployed in both)
  2. a difference between _ and - within the same category

2. Duplicated tags from integrations

Along with the above, different integration levels collect the same tags with different keys, resulting in examples like this (GCP & K8s):

cluster_name
cluster-name

I have no idea which one did what because

  1. the gcp docs don't say which tags are collected for what integration (like the aws docs) image
  2. the k8s docs don't say anywhere what tags are tacked on automatically (i assume it's cluster_name based on DD_CLUSTER_NAME image

3. Another naming oddity

# EBS
volumeid, volume-name, volume-type
# EC2
instance-id, name, instance-type
# ECS
instance_id, clustername, servicename

4. cloud_provider tag is inconsistently automatically applied

For agents running in GCP, the cloud_provider:gcp tag is automatically added to all things. However based on a chat with a support agent, the cloud_provider:aws tag is not automatically added for AWS:

image

This is inconsistent behavior

5. aws_account tag is inconsistently applied to AWS metrics

Some AWS metrics collected by the AWS integration are automatically tagged with the account number, while others lack this tag. For example, ELB metrics have this tag, but EC2 metrics do not.

This makes filtering on this tag value difficult when building a multi-account dashboard.

This is not a problem with GCP as all metrics are tagged with project_id.

Describe what you expected:

I expect:

This could be easier if I could actually configure how tags are formatted or configured.

It doesn't seem like the code allows for that right now since it is hardcoded: https://github.com/DataDog/datadog-agent/blob/eb4fa9ce5edc05fd2ac61df19ce5b98b9f727b35/pkg/util/gce/gce_tags.go#L73-L92

It would also be helpful if I can pick and choose which tags are collected, but that's not possible either.

I ask for this because building multi/cross cloud queries & dashboards are not trivial.

This seems like a sensible thing to do (review tag names and make them conform to some "datadog-internal" standard) so user's don't have a poor experience when trying to correlate data from multiple sources.

tonglil commented 3 years ago

Here's more anomalies:

AZs

The concept of zones use different tags for each cloud provider:

cloud name
azure availability_zone
aws availability-zone
gcp zone

Regions

The region tag is collected from Azure and AWS, however not from all GCP integrations. Some have it, but then some don't, like GAE, cloud nats, etc...

For select integrations it's named location instead, like cloud run, spanner, memcache, cloud tasks, dns, etc...

edwardaux commented 2 years ago

Tap, tap, tap... is this thing on? Just wondering if there's any plans to address this at all?

It makes it /really/ hard to build dashboards if the tagging isn't consistent.

ian28223 commented 2 years ago

@edwardaux Thanks for feedback. Have you raised this issue/open a ticket through the support channels? If not, I would advise that you do because most of the tagging issues you mentioned are not done by the agent (except maybe for the clustername in a k8s env); there are different teams involved with Crawler/Web/Cloud-based integrations that have no dependencies to the Datadog Agent (this repository). That said, a support ticket might be a better way for this issue to get more traction and have it routed to teams responsible.

tonglil commented 2 years ago

@ian28223 I'm sorry but Datadog is a complete product, and as customers of this product it is surprising to me that this kind of interweaving and crosscutting issue is not of interest to be addressed.

Furthermore, asking customers to open tickets when employees can do so much easier is just shocking to me. Tickets opened often end up closed as a "thanks we'll file this as a feature request" with no accountability. Sharing here provides visibility to other customers that they're not the only ones having issues with that process and the product itself.

Lastly, it's unfortunate to see redirecting the responsibility and involvement of the Datadog agent. As I clearly linked to in the original comment, there are multiple places where tags are set or used by the agent.

The "different teams involved with Crawler/Web/Cloud-based integrations that have no dependencies to the Datadog Agent" together make the user experience for this product and the issues outlined result in a poor user experience.

I recommend Datadog (or the Datadog agent team) to revaluate the "somebody's else's problem" approach it takes to this kind of tech debt.

btkostner commented 2 years ago

I'd like to add that this is very annoying because there are interfaces in Datadog that assume one type of tag. For instance, loading up the infrastructure map with gcp and gke looks like this by default:

image

Some of the group drop down options also don't work because they are named different in gcp:

image

I think an acceptable stop gap would be to allow aliasing tags, so if you were running in gcp and had a tag like cluster-location:us-east1 you could alias it to create region:us-east1.

ian28223 commented 2 years ago

I understand and I agree. I have raised this to the relevant team's PM. Proper tracking would still be via official support channels.

kallangerard commented 5 months ago

Here's more anomalies:

AZs

The concept of zones use different tags for each cloud provider:

cloud name azure availability_zone aws availability-zone gcp zone

Regions

The region tag is collected from Azure and AWS, however not from all GCP integrations. Some have it, but then some don't, like GAE, cloud nats, etc...

For select integrations it's named location instead, like cloud run, spanner, memcache, cloud tasks, dns, etc...

Note that for GCP, location is not the same as region. They're two separate attributes that often have the same value, but they're not the same.

See Cloud Storage Locations for example https://cloud.google.com/storage/docs/locations

tonglil commented 5 months ago

For compute, it's still zones and regions.

https://cloud.google.com/compute/docs/regions-zones

https://cloud.google.com/docs/geography-and-regions

location is not the same as region

Correct, they are supplementary to each other. While GCS doesn't allow you to choose a specific zone, they still have regions (which is conceptually shared across other products like compute) in addition to multi-region codes (aka locations).

For cloud run, you can only specify region (no zone or multi region).

gcloud storage buckets create gs://BUCKET_NAME --location=US --placement=US-CENTRAL1,US-EAST1

DD should collect tags for GCP's zone, region, and location as applicable - and unify them (or allow us to rename them so) to deal with multicloud setups.