adel121 commented 3 days ago

What does this PR do?

This PR includes the resource api group in the configuration parameter for generic metadata collection.

In other words, instead of having DD_CLUSTER_AGENT_KUBE_METADATA_COLLECTION_RESOURCES = [deployments statefulsets nodes], we will now have DD_CLUSTER_AGENT_KUBE_METADATA_COLLECTION_RESOURCES = [apps/deployments apps/statefulsets /nodes]

Motivation

Avoid collisions in cases were we have the same resource name under different api groups. An example of this is GKE:

On GKE, we have the nodes resource under two different API Groups:

metrics.k8s.io
"" (empty api group, corresponding to the default empty group in kubernetes)

In this case, if the user asks to collect metadata of nodes, it will not be possible to know if we need to collect metadata of

nodes.metrics.k8s.io
nodes

This results in a conflict.

Additional Notes

With this change, the user can also indicate the group version if they wish to by using the format {group}/{version}/{resource}. For example apps/v1/deployments. When using this format, the discovery client will not be used to fill the version, and the indicated version will be used as it is.

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

❗ For better validation, do this QA on GKE because the issue was initially discovered on GKE due to having same resource name under different api groups (see #motivation section for more information) ❗

Deploy the cluster agent with the following helm file:

datadog:
  apiKeyExistingSecret: datadog-secret
  appKeyExistingSecret: datadog-secret
  kubelet:
    tlsVerify: false

clusterAgent:
  enabled: true
  replicas: 1
  env:
    - name: DD_CLUSTER_AGENT_KUBE_METADATA_COLLECTION_ENABLED
      value: "true"
    - name: DD_CLUSTER_AGENT_KUBE_METADATA_COLLECTION_RESOURCES
      value: "apps/deployments apps/daemonsets /nodes"

Ensure that metadata is collected successfully for deployments, daemonsets, and nodes.

kubectl exec <cluster-agent-pod> -- agent workload-list -v

=== Entity kubernetes_metadata sources(merged):[kubeapiserver] id: deployments/kube-system/kube-dns-autoscaler ===
----------- Entity ID -----------
Kind: kubernetes_metadata ID: deployments/kube-system/kube-dns-autoscaler

----------- Entity Meta -----------
Name: kube-dns-autoscaler
Namespace: kube-system
Annotations: deployment.kubernetes.io/revision:1 
Labels: addonmanager.kubernetes.io/mode:Reconcile k8s-app:kube-dns-autoscaler kubernetes.io/cluster-service:true 
----------- Resource -----------
apps/v1, Resource=deployments
===

=== Entity kubernetes_metadata sources(merged):[kubeapiserver] id: nodes//gke-adelhajhassan-default-pool-14a7bd1d-jnf2 ===
----------- Entity ID -----------
Kind: kubernetes_metadata ID: nodes//gke-adelhajhassan-default-pool-14a7bd1d-jnf2

----------- Entity Meta -----------
Name: gke-adelhajhassan-default-pool-14a7bd1d-jnf2
Namespace: 
Annotations: node.gke.io/last-applied-node-taints: volumes.kubernetes.io/controller-managed-attach-detach:true container.googleapis.com/instance_id:3216393220270216000 csi.volume.kubernetes.io/nodeid:{"pd.csi.storage.gke.io":"projects/datadog-sandbox/zones/us-central1-c/instances/gke-adelhajhassan-default-pool-14a7bd1d-jnf2"} node.alpha.kubernetes.io/ttl:0 node.gke.io/last-applied-node-labels:cloud.google.com/gke-boot-disk=pd-balanced,cloud.google.com/gke-container-runtime=containerd,cloud.google.com/gke-cpu-scaling-level=2,cloud.google.com/gke-logging-variant=DEFAULT,cloud.google.com/gke-max-pods-per-node=110,cloud.google.com/gke-nodepool=default-pool,cloud.google.com/gke-os-distribution=cos,cloud.google.com/gke-provisioning=standard,cloud.google.com/gke-stack-type=IPV4,cloud.google.com/machine-family=e2,cloud.google.com/private-node=false 
Labels: beta.kubernetes.io/arch:amd64 cloud.google.com/gke-boot-disk:pd-balanced cloud.google.com/gke-cpu-scaling-level:2 kubernetes.io/arch:amd64 topology.gke.io/zone:us-central1-c cloud.google.com/gke-max-pods-per-node:110 cloud.google.com/gke-nodepool:default-pool cloud.google.com/gke-provisioning:standard failure-domain.beta.kubernetes.io/region:us-central1 topology.kubernetes.io/zone:us-central1-c cloud.google.com/gke-container-runtime:containerd cloud.google.com/gke-logging-variant:DEFAULT cloud.google.com/gke-os-distribution:cos failure-domain.beta.kubernetes.io/zone:us-central1-c kubernetes.io/os:linux topology.kubernetes.io/region:us-central1 node.kubernetes.io/instance-type:e2-medium beta.kubernetes.io/instance-type:e2-medium beta.kubernetes.io/os:linux cloud.google.com/gke-stack-type:IPV4 cloud.google.com/machine-family:e2 cloud.google.com/private-node:false kubernetes.io/hostname:gke-adelhajhassan-default-pool-14a7bd1d-jnf2 
----------- Resource -----------
/v1, Resource=nodes
===

=== Entity kubernetes_metadata sources(merged):[kubeapiserver] id: daemonsets/gmp-system/collector ===
----------- Entity ID -----------
Kind: kubernetes_metadata ID: daemonsets/gmp-system/collector

----------- Entity Meta -----------
Name: collector
Namespace: gmp-system
Annotations: components.gke.io/layer:addon 
Labels: addonmanager.kubernetes.io/mode:Reconcile 
----------- Resource -----------
apps/v1, Resource=daemonsets
===

pr-commenter[bot] commented 3 days ago

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=38263141 --os-family=ubuntu

Note: This applies to commit bfe0aefb

pr-commenter[bot] commented 3 days ago

Regression Detector

Regression Detector Results

Run ID: e517c00c-f6f1-4afd-9c58-aedd17962133 Metrics dashboard Target profiles

Baseline: f350ef14a5ecfaf059e313b7d87460aa24460a81 Comparison: bfe0aefbf1d1406d960e58140502d2d705d74d82

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

| perf | experiment | goal | Δ mean % | Δ mean % CI | links | |------|----------------------------|--------------------|----------|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ➖ | basic_py_check | % cpu utilization | +0.10 | [-2.55, +2.76] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Abasic_py_check%20run_id%3Ae517c00c-f6f1-4afd-9c58-aedd17962133&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720012747000&to_ts=1720024147000&live=false) | | ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.01, +0.01] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Atcp_dd_logs_filter_exclude%20run_id%3Ae517c00c-f6f1-4afd-9c58-aedd17962133&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720012747000&to_ts=1720024147000&live=false) | | ➖ | uds_dogstatsd_to_api | ingress throughput | -0.00 | [-0.00, +0.00] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Auds_dogstatsd_to_api%20run_id%3Ae517c00c-f6f1-4afd-9c58-aedd17962133&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720012747000&to_ts=1720024147000&live=false) | | ➖ | idle | memory utilization | -0.08 | [-0.11, -0.05] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Aidle%20run_id%3Ae517c00c-f6f1-4afd-9c58-aedd17962133&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720012747000&to_ts=1720024147000&live=false) | | ➖ | file_tree | memory utilization | -0.12 | [-0.20, -0.04] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Afile_tree%20run_id%3Ae517c00c-f6f1-4afd-9c58-aedd17962133&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720012747000&to_ts=1720024147000&live=false) | | ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.60 | [-13.42, +12.23] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Atcp_syslog_to_blackhole%20run_id%3Ae517c00c-f6f1-4afd-9c58-aedd17962133&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720012747000&to_ts=1720024147000&live=false) | | ➖ | pycheck_1000_100byte_tags | % cpu utilization | -0.62 | [-5.32, +4.08] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Apycheck_1000_100byte_tags%20run_id%3Ae517c00c-f6f1-4afd-9c58-aedd17962133&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720012747000&to_ts=1720024147000&live=false) | | ➖ | otel_to_otel_logs | ingress throughput | -1.07 | [-1.88, -0.26] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Aotel_to_otel_logs%20run_id%3Ae517c00c-f6f1-4afd-9c58-aedd17962133&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720012747000&to_ts=1720024147000&live=false) | | ➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -1.23 | [-2.11, -0.35] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Auds_dogstatsd_to_api_cpu%20run_id%3Ae517c00c-f6f1-4afd-9c58-aedd17962133&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720012747000&to_ts=1720024147000&live=false) |

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI". For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true: 1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look. 2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that *if our statistical model is accurate*, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants. 3. Its configuration does not mark it "erratic".

GustavoCaso commented 2 days ago

Is this change backward compatible? meaning if I had DD_CLUSTER_AGENT_KUBE_METADATA_COLLECTION_RESOURCES = [deployments statefulsets nodes] would that still works as expected with the code?

adel121 commented 2 days ago

Is this change backward compatible? meaning if I had DD_CLUSTER_AGENT_KUBE_METADATA_COLLECTION_RESOURCES = [deployments statefulsets nodes] would that still works as expected with the code?

No it is not backward compatible, but this config option is not publicly documented, and is not used in the helm chart nor in the operator, so nothing should break.

adel121 commented 1 day ago

/merge

dd-devflow[bot] commented 1 day ago

:steam_locomotive: MergeQueue: pull request added to the queue

The median merge time in main is 25m.

Use /merge -c to cancel this operation!

DataDog / datadog-agent

[CONTP-283] Should require also API Group (in addition to resource name) for generic metadata collection #27225