[processor/lsminterval] Handle overflow for metrics aggregations

Aggregations could be boundless with huge cardinality OR with buggy instrumentation. To protect against these, the aggregated metrics should have limits and overflow to specific buckets after those limits are breached. This would be similar to what is implemented in https://github.com/elastic/apm-aggregation

Current overflow handling in APM Server using apm-aggregation

APM aggregation implements overflow handling logic based on a set of factors, each having their own threshold. These factors are configured in APM Server based on the available resources or hard-coded if the use-case is known. By default, APM Server sets all the limits based on the memory available using a linear scaling based on the formula mem_available * constant_factor. For example: for services, the constant factor is 1000, i.e. 8GB APM-Server will have a limit on the maximum number of services as 8000 (ref).

Maximum number of services

The number of unique services for the purpose of aggregated metrics is defined as the cardinality of the following set of labels (identified as service aggregation key):

timestamp truncated based on the interval
service.name
service.environment
service.language.name
agent.name
Global labels

When the max service limit for an aggregation interval is reached then a new service_summary metric is produced with the service name set to _other and the timestamp set to the processing timestamp - service summary metrics in APM server doesn't record any numeric value. In addition, another metric with the name of service_summary.aggregation.overflow_count is produced that records service aggregation keys overflowed for the given interval.

This overflow service bucket acts like a catch-all for all overflow aggregated metrics i.e. in addition to the service count, the overflow services also record service transaction, transaction, and span metrics in their corresponding data types (histograms, aggregate_metric_double, or counters) with the following overflow identifiers

Overflow service transaction is identified by transaction.type set to _other with service.name set to _other
Overflow transaction is identified by transaction.name set to _other with service.name set to _other
Overflow span is identified by service.target.name set to _other with service.name set to _other

Maximum number of service transactions

Service transactions are the number of unique transaction types within a service (this is based on the current key definition and can change in future). The number of unique service transactions for the purpose of aggregated metrics is defined as the cardinality of the following set of labels (identified as service transaction aggregation key):

transaction.type

When the maximum service transaction limit within a service is reached then a new metric is created for the service that reached the limit with transaction.type set to _other. This new service transaction will now capture ALL the NEW service transactions for the current interval as overflow within the required metric types for a normal service transaction metric.

Maximum number of transactions

Transactions are the number of unique transaction keys within the service where a transaction key is defined with the following fields:

trace.root (different than transaction root
container.id
kubernetes.pod.name
service.version
service.node.name
service.runtime.name
service.runtime.version
service.language.version
host.hostname
host.name
host.os.platform
event.outcome
transaction.name
transaction.type
transaction.result
faas.coldstart
faas.id
faas.name
faas.version
faas.trigger.type
cloud.provider
cloud.region
cloud.availability_zone
cloud.service.name
cloud.account.id
cloud.account.name
cloud.machine_type
cloud.project.id
cloud.project.name

When the maximum transaction limit within a service is reached then a new metric is created for the service that reached the limit with transaction.name set to _other. This new transaction will now capture ALL the NEW transactions for the current interval as overflow within the required metric types for a normal transaction metric.

Maximum number of spans

Spans are the number of unique span keys within a service where the span key is defined with the following fields:

span.name
event.outcome
service.target.type
service.target.name
service.destination_service.resource

When the maximum span limit within a service is reached then a new metric is created for the service that reached the limit with service.target.name set to _other. This new span will now capture ALL the NEW spans for the current interval as overflow within the required metric types for a normal span metric.

Proposal for overflow handling in LSM interval processor

Signal to metrics connector and LSM interval processor together provide the OTel native way to do aggregations as required by APM. Signal to metrics connector has the role of extracting the aggregated metrics from the incoming signals and LSM interval processor simply aggregates the metrics for the specified aggregation intervals. A sample configuration used for both components to produce APM aggregated metrics can be perused here.

The OTel implementation differs from the APM implementation due to the fact that the aggregated metrics cannot be first-class entities. Instead, the aggregated metrics have to be defined as any other metrics and processed by the components. In order to formalize handling overflows with the OTel data model, we have to identify the problem that the overflow handling in APM solves.

Problem

The main purpose of aggregating metrics is to reduce storage and query costs by doing aggregations during the ingestion process. The aggregation shines when the cardinality of the attributes that are being aggregated is bounded. High cardinality attributes blow up the memory requirements, increasing the resource requirements for the aggregations during ingestion. Unbounded cardinality makes things worse and aggregations with bigger intervals are almost made impractical. Unbounded/high cardinality is usually due to bugs in instrumentation or simply bad instrumentation.

Assumptions

Aggregations are performed with a specific set of attributes or the attributes are known and defined.
Cardinality for attributes over which aggregations are performed are expected to be bounded within the defined aggregation intervals.
The component (in our case lsminterval processor) is given enough memory to handle aggregations with the expected cardinality.

Proposal

Solving the problem of unbounded/high cardinality attributes requires identifying that there is an issue and reporting it to the owners of the instrumentation. To this end, our OTel pipeline needs a way to identify the cardinality issues during aggregation.

Keeping the above assumptions in mind, the proposal will act like a cardinality limiter, over defined set of attributes, with overflow buckets when the cardinality exceeds the defined limits.

(The names of the configuration might be a bit confusing, will update with better names as we evolve this)

limits:
  - action: oneOf{"drop", "overflow"} # if drop is configured the metrics exceeding the limit are simply dropped
    resource_attributes: # A list of resource attributes over which to apply the cardinality limits
      - key: <string> # If the resource is not present in the input then an empty value will be used for cardinality calculation
    scope_attributes: [] # A list of scope attributes over which to apply the cardinality limit, empty means no limits
    datapoint_attributes: [] # A list of datapoint attributes over which to apply the cardinality limit
    max_size: <int> # the max cardinality for the above set of attributes
    # Below configuration are only used if action is `overflow`
    overflow: # Defines how overflow buckets/resource metrics will be constructed
      resource_attributes: # A list of static resource attributes to add to the overflow buckets
        - key: <string>
          value: <any>
      scope_attributes: []
      datapoint_attributes: []

Few points to note:

All the defined limits will be applied in the order they are defined -- this means that overflows can be produced from overflows.
In case of an overflow, all the attributes that are defined in the resource_attributes, scope_attributes, or datapoint_attributes will be stripped from the incoming data. The resulting metric will then be enriched with the attributes defined in overflow section and added to the existing or new datapoint.
Point 2 means that we could still have a high cardinality on the overflow bucket as there could be more attributes than what we define in limits. However, assuming we know the attributes that are getting aggregated, we can always limit the overflow by producing more overflows with other attributes.
In addition to the above configuration, we could also introduce a global limiter on max number of resource metrics, scope metrics or datapoints as a catch-all protection -- I am not yet sure about how this will interact with the above defined limits.

Thanks for the writeup @lahsivjar. Seems reasonable overall. It feels a bit awkward that the definition of metrics & overflow logic are defined in two different places, but I'm not sure if there's a better option.

What would help me to understand this a bit better is an example configuration of both the signaltometricsconnector plus the lsmintervalprocessor.

One aspect of the APM aggregations that I don't see mentioned, and I'm not sure is possible in the proposal, is the protection against a single service consuming most of the cardinality limit.

Aside from that, I'm wondering if it could make the configuration a bit more concise to combine the overflow section with the top-level *_attributes sections. For example

    resource_attributes:
      - key: <string>
        overflow_value: <any> # optional

elastic / opentelemetry-collector-components