szegedi commented 3 months ago

What does this PR do?

Allows Continuous Profiler to be enabled (or disabled) by Helm Charts config on the controller pod. The cluster-agent, running on the controller pod, will read an environment variable and use this to mutate the configuration of other pods to set the environment variables that will activate profiling within the tracer/profiler client libraries.

This PR follows the same approach as #23618 did for activation of ASM products.

Motivation

to make it easier for k8s clients to activate Continous Profiling. Simplified installation is a common request.

Additional Notes

The PR is designed to establish the fundamentals that will make these other PRs work:

https://github.com/DataDog/helm-charts/pull/1443
https://github.com/DataDog/datadog-operator/pull/1271 It is necessary to first have the functionality here in the agent before we can make the Helm and Operator changes available.

Continuous Profiler will have the env var DD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_APPSEC_ENABLED set by changes in those Datadog Operator and Helm Charts PRs. It will result in DD_PROFILING_ENABLED being propagated to all pods (or those conforming to the filters).

Possible Drawbacks / Trade-offs

More complexity in our config handling, to make it easier for customers.

Describe how to test/QA your changes

Unit tests have been added and ensured to pass with invoke test --targets=./pkg/clusteragent

bits-bot commented 3 months ago

All committers have signed the CLA.

pr-commenter[bot] commented 3 months ago

Regression Detector

Regression Detector Results

Run ID: 7584aa4d-b07d-41ef-80f6-853130a1ba4f Metrics dashboard Target profiles

Baseline: c7e91281df20bf43b47de56e1d16cc889da1bcb4 Comparison: 8855e1b2daf00f016e3e2a834bf7ec04c879738d

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

| perf | experiment | goal | Δ mean % | Δ mean % CI | links | |------|----------------------------|--------------------|----------|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ➖ | tcp_syslog_to_blackhole | ingress throughput | +1.73 | [-10.99, +14.45] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Atcp_syslog_to_blackhole%20run_id%3A7584aa4d-b07d-41ef-80f6-853130a1ba4f&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720531692000&to_ts=1720543092000&live=false) | | ➖ | file_tree | memory utilization | +0.72 | [+0.63, +0.80] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Afile_tree%20run_id%3A7584aa4d-b07d-41ef-80f6-853130a1ba4f&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720531692000&to_ts=1720543092000&live=false) | | ➖ | otel_to_otel_logs | ingress throughput | +0.54 | [-0.27, +1.36] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Aotel_to_otel_logs%20run_id%3A7584aa4d-b07d-41ef-80f6-853130a1ba4f&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720531692000&to_ts=1720543092000&live=false) | | ➖ | pycheck_1000_100byte_tags | % cpu utilization | +0.41 | [-4.50, +5.31] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Apycheck_1000_100byte_tags%20run_id%3A7584aa4d-b07d-41ef-80f6-853130a1ba4f&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720531692000&to_ts=1720543092000&live=false) | | ➖ | idle | memory utilization | +0.07 | [+0.02, +0.12] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Aidle%20run_id%3A7584aa4d-b07d-41ef-80f6-853130a1ba4f&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720531692000&to_ts=1720543092000&live=false) | | ➖ | basic_py_check | % cpu utilization | +0.06 | [-2.55, +2.67] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Abasic_py_check%20run_id%3A7584aa4d-b07d-41ef-80f6-853130a1ba4f&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720531692000&to_ts=1720543092000&live=false) | | ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.01, +0.01] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Atcp_dd_logs_filter_exclude%20run_id%3A7584aa4d-b07d-41ef-80f6-853130a1ba4f&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720531692000&to_ts=1720543092000&live=false) | | ➖ | uds_dogstatsd_to_api | ingress throughput | -0.00 | [-0.00, +0.00] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Auds_dogstatsd_to_api%20run_id%3A7584aa4d-b07d-41ef-80f6-853130a1ba4f&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720531692000&to_ts=1720543092000&live=false) | | ➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -0.51 | [-1.42, +0.39] | [Logs](https://app.datadoghq.com/logs?query=experiment%3Auds_dogstatsd_to_api_cpu%20run_id%3A7584aa4d-b07d-41ef-80f6-853130a1ba4f&agg_m=count&agg_m_source=base&agg_q=%40span.url&agg_q_source=base&agg_t=count&fromUser=true&index=single-machine-performance-target-logs&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=time%2Cdesc&top_n=100&top_o=top&viz=stream&x_missing=true&from_ts=1720531692000&to_ts=1720543092000&live=false) |

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI". For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true: 1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look. 2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that *if our statistical model is accurate*, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants. 3. Its configuration does not mark it "erratic".

pr-commenter[bot] commented 3 months ago

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=38725070 --os-family=ubuntu

Note: This applies to commit 8855e1b2

szegedi commented 2 months ago

Thanks for the suggestion @adel121! There's currently an e2e onboarding system tests draft PR that exercises this functionality (with the drawback that it can only run tests for a released agent.) I'm glad to see that there's an AWS/Pulumi framework locally in the agent as well. I learned since from @robertomonteromiguel that folks that created agent's e2e have helped him create the onboarding system tests infrastructure too. I'll figure out how to validate the feature in this e2e framework and follow up with a test.

szegedi commented 2 months ago

/merge

dd-devflow[bot] commented 2 months ago

:steam_locomotive: MergeQueue: pull request added to the queue

The median merge time in main is 32m.

Use /merge -c to cancel this operation!

DataDog / datadog-agent

PROF-10073: Read and propagate helm config for profiling #27185