Open lahsivjar opened 1 month ago
APM aggregation implements overflow handling logic based on a set of factors, each having their own threshold. These factors are configured in APM Server based on the available resources or hard-coded if the use-case is known. By default, APM Server sets all the limits based on the memory available using a linear scaling based on the formula mem_available * constant_factor
. For example: for services, the constant factor is 1000
, i.e. 8GB APM-Server will have a limit on the maximum number of services as 8000
(ref).
The number of unique services for the purpose of aggregated metrics is defined as the cardinality of the following set of labels (identified as service aggregation key):
timestamp
truncated based on the intervalservice.name
service.environment
service.language.name
agent.name
When the max service limit for an aggregation interval is reached then a new service_summary
metric is produced with the service name set to _other
and the timestamp set to the processing timestamp - service summary metrics in APM server doesn't record any numeric value. In addition, another metric with the name of service_summary.aggregation.overflow_count
is produced that records service aggregation keys overflowed for the given interval.
This overflow service bucket acts like a catch-all for all overflow aggregated metrics i.e. in addition to the service count, the overflow services also record service transaction, transaction, and span metrics in their corresponding data types (histograms, aggregate_metric_double, or counters) with the following overflow identifiers
transaction.type
set to _other
with service.name
set to _other
transaction.name
set to _other
with service.name
set to _other
service.target.name
set to _other
with service.name
set to _other
Service transactions are the number of unique transaction types within a service (this is based on the current key definition and can change in future). The number of unique service transactions for the purpose of aggregated metrics is defined as the cardinality of the following set of labels (identified as service transaction aggregation key):
transaction.type
When the maximum service transaction limit within a service is reached then a new metric is created for the service that reached the limit with transaction.type
set to _other
. This new service transaction will now capture ALL the NEW service transactions for the current interval as overflow within the required metric types for a normal service transaction metric.
Transactions are the number of unique transaction keys within the service where a transaction key is defined with the following fields:
trace.root
(different than transaction rootcontainer.id
kubernetes.pod.name
service.version
service.node.name
service.runtime.name
service.runtime.version
service.language.version
host.hostname
host.name
host.os.platform
event.outcome
transaction.name
transaction.type
transaction.result
faas.coldstart
faas.id
faas.name
faas.version
faas.trigger.type
cloud.provider
cloud.region
cloud.availability_zone
cloud.service.name
cloud.account.id
cloud.account.name
cloud.machine_type
cloud.project.id
cloud.project.name
When the maximum transaction limit within a service is reached then a new metric is created for the service that reached the limit with transaction.name
set to _other
. This new transaction will now capture ALL the NEW transactions for the current interval as overflow within the required metric types for a normal transaction metric.
Spans are the number of unique span keys within a service where the span key is defined with the following fields:
span.name
event.outcome
service.target.type
service.target.name
service.destination_service.resource
When the maximum span limit within a service is reached then a new metric is created for the service that reached the limit with service.target.name
set to _other
. This new span will now capture ALL the NEW spans for the current interval as overflow within the required metric types for a normal span metric.
Signal to metrics connector and LSM interval processor together provide the OTel native way to do aggregations as required by APM. Signal to metrics connector has the role of extracting the aggregated metrics from the incoming signals and LSM interval processor simply aggregates the metrics for the specified aggregation intervals. A sample configuration used for both components to produce APM aggregated metrics can be perused here.
The OTel implementation differs from the APM implementation due to the fact that the aggregated metrics cannot be first-class entities. Instead, the aggregated metrics have to be defined as any other metrics and processed by the components. In order to formalize handling overflows with the OTel data model, we have to identify the problem that the overflow handling in APM solves.
The main purpose of aggregating metrics is to reduce storage and query costs by doing aggregations during the ingestion process. The aggregation shines when the cardinality of the attributes that are being aggregated is bounded. High cardinality attributes blow up the memory requirements, increasing the resource requirements for the aggregations during ingestion. Unbounded cardinality makes things worse and aggregations with bigger intervals are almost made impractical. Unbounded/high cardinality is usually due to bugs in instrumentation or simply bad instrumentation.
Solving the problem of unbounded/high cardinality attributes requires identifying that there is an issue and reporting it to the owners of the instrumentation. To this end, our OTel pipeline needs a way to identify the cardinality issues during aggregation.
Keeping the above assumptions in mind, the proposal will act like a cardinality limiter, over defined set of attributes, with overflow buckets when the cardinality exceeds the defined limits.
(The names of the configuration might be a bit confusing, will update with better names as we evolve this)
limits:
- action: oneOf{"drop", "overflow"} # if drop is configured the metrics exceeding the limit are simply dropped
resource_attributes: # A list of resource attributes over which to apply the cardinality limits
- key: <string> # If the resource is not present in the input then an empty value will be used for cardinality calculation
scope_attributes: [] # A list of scope attributes over which to apply the cardinality limit, empty means no limits
datapoint_attributes: [] # A list of datapoint attributes over which to apply the cardinality limit
max_size: <int> # the max cardinality for the above set of attributes
# Below configuration are only used if action is `overflow`
overflow: # Defines how overflow buckets/resource metrics will be constructed
resource_attributes: # A list of static resource attributes to add to the overflow buckets
- key: <string>
value: <any>
scope_attributes: []
datapoint_attributes: []
Few points to note:
resource_attributes
, scope_attributes
, or datapoint_attributes
will be stripped from the incoming data. The resulting metric will then be enriched with the attributes defined in overflow
section and added to the existing or new datapoint.Thanks for the writeup @lahsivjar. Seems reasonable overall. It feels a bit awkward that the definition of metrics & overflow logic are defined in two different places, but I'm not sure if there's a better option.
What would help me to understand this a bit better is an example configuration of both the signaltometricsconnector
plus the lsmintervalprocessor
.
One aspect of the APM aggregations that I don't see mentioned, and I'm not sure is possible in the proposal, is the protection against a single service consuming most of the cardinality limit.
Aside from that, I'm wondering if it could make the configuration a bit more concise to combine the overflow
section with the top-level *_attributes
sections. For example
resource_attributes:
- key: <string>
overflow_value: <any> # optional
Aggregations could be boundless with huge cardinality OR with buggy instrumentation. To protect against these, the aggregated metrics should have limits and overflow to specific buckets after those limits are breached. This would be similar to what is implemented in https://github.com/elastic/apm-aggregation