How do we emit internal telemetry that works with existing Datadog Agent operational tooling?

tobz commented 1 month ago

At a high-level, both Datadog Agent and Agent Data Plane/Saluki emit internal telemetry used for debugging performance issues and understanding their operational state. However, the naming differs between the two by a large amount, even for metrics that are functionally identical. This makes it challenging to use ADP, as it currently exists, as a drop-in replacement for DSD support in the core Agent.

The metric prefix we use when emitting internal metrics is configurable at the tippity top when initializing the metrics subsystem via saluki_app::metrics::initialize_metrics, so that's fine... but how do we line up individual metrics with their spiritual equivalent in the Datadog Agent?

This is a problem we need to solve if we hope to have ADP replace DSD in the core Agent.

tobz commented 1 month ago

One idea: metric remapping.

Conceptually, specific components in Saluki map to specific components in the core Agent. For example, the DogStatsD source in ADP is the dogstatsd component in the Datadog Agent, and the Datadog Metrics destination in ADP is the defaultforwarder component in the Datadog Agent. If we included the component type in internal metrics (e.g., metrics from the Datadog Metrics destination have a component_type tag with a value of datadog_metrics), we could conceivably use that to remap metrics to their Datadog Agent equivalent.

For example, datadog.agent.transactions.errors in the Datadog Agent is used to track "transaction errors", which occur when the default forwarder fails to send a request to the Datadog intake. The error_type tag indicates the specific type of error. Similarly, on the Saluki side, the Datadog Metrics destination emits a component_errors_total metric, with an error_type tag that has a value of http_send, when we fail to send a request.

Since we should expect to only have one Datadog Metrics destination running in ADP, we could conceivably map all instances of component_errors_total, where component_type was equal to datadog_metrics, to agent.transactions.errors.. and potentially map the error_type tag as well.

We could likely do this pretty simply with a dedicated transform that remaps metric names, perhaps one even designed solely for remapping to Datadog Agent-equivalent metric names. Biggest downside, I think, is just the general aspect of us having to maintain this mapping in the first place rather than doing it by default.

tobz commented 1 month ago

Another idea: change all points where we register metrics to also register Datadog Agent-specific versions.

Essentially, we would emit duplicate metrics -- a generically-named one for "pure" Saluki usage, and a Datadog Agent-specific one -- and that way anything using Saluki that wasn't ADP could have the more generic/flexible metric names, and ADP could still emit the Datadog Agent-specific metric names to meet our goal of being drop-in compatible.

This, obviously, means emitting more telemetry than absolutely necessary. If we really didn't want to do that, we could also have a transform for filtering out the generically-named metrics, leaving only the Datadog Agent-specific ones. We could also, perhaps, try and do something where we have a toggle for emitting the Saluki or Datadog Agent version... but threading that state all through Saluki would be very ugly.

DataDog / saluki

How do we emit internal telemetry that works with existing Datadog Agent operational tooling? #118