elastic / apm-server

https://www.elastic.co/guide/en/apm/guide/current/index.html
Other
1.21k stars 518 forks source link

Investigate feasibility of calculating breakdown metrics #5936

Open axw opened 3 years ago

axw commented 3 years ago

Currently if someone uses OTel or Jaeger, the "time spent by span type" will be empty: https://github.com/elastic/apm/issues/471. The reason behind this is that breakdown metrics are computed by agents.

As an alternative, we could hypothetically have a feature where APM Server computes breakdown metrics. Similar to tail-based sampling, it would work by buffering events. When a transaction is received, the server would assemble all related spans (i.e. those with a matching transaction.id), and compute breakdown metrics.

This would only work (and only be supported) when all transaction (not trace) events are sent to the same APM Server. That would be straightforward when running APM Server co-located with the instrumented service.

felixbarny commented 3 years ago

This would only work (and only be supported) when all events are sent to the same APM Server

That was one of the main reasons why we implemented breakdown metrics in agents. With APM Server running on the edge that certainly becomes viable again. We might even want to do that when receiving data from our agents. Our agents could disable breakdown metrics based on which APM Server version they're running on.

axw commented 2 years ago

There's a POC in https://github.com/axw/apm-server/tree/breakdown

This branch introduces a new model processor which aggregates transactions and their reachable spans, calculating span breakdown metrics.

One significant issue I came across is how to deal with async spans that start before the transaction ends, but end after the transaction ends. In this case the transaction is received by apm-server before the span, so we would need to use a timer to reach these spans. This is the biggest downside of performing the calculation outside of agents. It's much easier within agents because we can take action when a span starts, and not just when it ends.

@felixbarny and I had a short discussion about whether it would be reasonable to revise how we calculate breakdown metrics, such that spans that end after a transaction are completely disregarded. Felix said:

I think that's fair enough. If a span outlives its transaction, one can reasonably assume that it did not impact the response time of the transaction. Following that line of thinking, it's safe to not consider spans that end after their transaction in breakdown metrics.

Definitely seems worth to me to drop the timers which lowers complexity (I think) and increases efficiency.

With that change in place, we could trigger aggregation on transaction receipt: buffer span events and process reachable spans with the same trace ID when a transaction is received. Spans may be buffered in Badger, like in the POC, to avoid unbounded memory growth.

I think the only remaining hurdle is to avoid duplicating these metrics for agents that already calculate breakdowns. A couple of options:

  1. Just enable this for OpenTelemetry and Jaeger for now
  2. Discard breakdown metrics sent by our agents, deprecate and remove feature from agents later.

Option 2 is a little complicated by RUM, which produces breakdown metrics for page-load transactions for which there are no corresponding spans. i.e. DNS, TCP, Request, Response, Processing, Load, using the navigation timing API.

We could add an exception for RUM, and not drop those page-load breakdown metrics. Alternatively, we could explore adding navigation timing details to page-load transaction events. Then the server could aggregate those too, and the details may be useful for filtering and analysing individual page loads.

axw commented 2 years ago

Example:

image

func doWork(ctx context.Context) {                          
        tracer := otel.Tracer("test-tracer")                
        ctx, span := tracer.Start(ctx, "example")           
        defer span.End()                                    

        _, span1 := tracer.Start(ctx, "db_span")            
        span1.SetAttributes(semconv.DBSystemElasticsearch)  
        time.Sleep(50 * time.Millisecond)                   
        span1.End()                                         

        _, span2 := tracer.Start(ctx, "rpc_span")           
        span2.SetAttributes(                                
                semconv.RPCSystemKey.String("grpc"),        
                semconv.NetPeerNameKey.String("rpc_server"),
                semconv.NetPeerPortKey.Int(1234),           
        )                                                   
        time.Sleep(100 * time.Millisecond)                  
        span2.End()                                         

        time.Sleep(200 * time.Millisecond)                  
}                                                           
davemoore- commented 2 years ago

Does this issue also affect the display of the "Instances" and "Instances latency distribution" panels of the Overview tab and Metrics tab of a service? I'm using OpenTelemetry and I see that these metrics are empty, too, in addition to the "Time spent by span type" panel (here). There is also an error pop-up on those pages that says:

Error while fetching resource

Error
search_phase_execution_exception: [script_exception] Reason: link error (500)

I observed this on Elastic 8.1.2 on Elastic Cloud.

axw commented 2 years ago

@davemoore- no, that would be unrelated to these metrics.

I think that would happen if your OTel traces do not have either a container ID or host name in resource attributes. If you have a small program that reproduces the issue, would you mind providing it in a new issue?