elastic / apm-agent-java

https://www.elastic.co/guide/en/apm/agent/java/current/index.html
Apache License 2.0
567 stars 320 forks source link

span composite count must be > 1 #3265

Open kruskall opened 1 year ago

kruskall commented 1 year ago

Describe the bug

From the APM Agents spec (https://github.com/elastic/apm/blob/main/specs/agents/handling-huge-traces/tracing-spans-compress.md?plain=1#L137-L138):

  • composite
  • count: The number of compressed spans this composite span represents. The minimum count is 2 as a composite span represents at least two spans.

It looks like the java agent is reporting composite spans with a lower count. The following error was observed in ecs logs:

validation error: span: composite: 'count': validation rule 'min(2)' violated

Steps to reproduce

I don't have those unfortunately :(

This is part of the ops KPI review

Expected behavior

Composite count must be > 1

Debug logs

validation error: span: composite: 'count': validation rule 'min(2)' violated

JonasKunz commented 1 year ago

By looking through the code, this seems to only be possible by a very specific race-condition. Given the following trace scenario:

Now the following things need to happen:

  1. DB1 is ended. It is not sent to the APM-server but buffered on parent for future arriving compressed spans
  2. DB2 and Async1 are ended in parallel at the same time a. DB2 initializes the compression-datastruction on DB1 this causes count=1 b. Async1 sees that it cannot compress with DB1 and causes DB1 to be reported immediately c. DB2 increments the count on DB1 due to compression. This doesn't have any effect because DB1 is already sent.

I think what we could do is guard the compression process with a WriterReaderPhaser: During compression DB2 would register itself as Writer on DB1 before performing any mutations. The reporting-thread would swap the phase before reporting a span, waiting for pending compressions to end. This however opens up additional things to consider: what do we do when a span attempts compression, but the phase of the WriterReaderPhaser indicates that the span he wants to compress with has already been sent? E.g. retry a fixed amount of times?

I'd love to have some input from @elastic/apm-agent-java (and maybe @felixbarny and @eyalkoren as original authors) here. Is it known (and accepted) that the span-compression is prone to lost-update race conditions?

As a bandaid fix to prevent spans from being rejected by the APM-server I'll add a guard to the serialization logic so that we don't report spans with count=1 as compressed.

jackshirazi commented 1 year ago

Span compression is purely for improved UI and I don't think it's worth a lot of effort to handle edge cases in high contention cases other than to just drop the odd compression rather than add load. So I think the proposed band aid is sufficient