Open kruskall opened 1 year ago
By looking through the code, this seems to only be possible by a very specific race-condition. Given the following trace scenario:
Now the following things need to happen:
DB1
is ended. It is not sent to the APM-server but buffered on parent
for future arriving compressed spansDB2
and Async1
are ended in parallel at the same time
a. DB2
initializes the compression-datastruction on DB1
this causes count=1
b. Async1
sees that it cannot compress with DB1
and causes DB1
to be reported immediately
c. DB2
increments the count
on DB1
due to compression. This doesn't have any effect because DB1
is already sent.I think what we could do is guard the compression process with a WriterReaderPhaser
: During compression DB2
would register itself as Writer
on DB1
before performing any mutations. The reporting-thread would swap the phase before reporting a span, waiting for pending compressions to end.
This however opens up additional things to consider: what do we do when a span attempts compression, but the phase of the WriterReaderPhaser
indicates that the span he wants to compress with has already been sent? E.g. retry a fixed amount of times?
I'd love to have some input from @elastic/apm-agent-java (and maybe @felixbarny and @eyalkoren as original authors) here. Is it known (and accepted) that the span-compression is prone to lost-update race conditions?
As a bandaid fix to prevent spans from being rejected by the APM-server I'll add a guard to the serialization logic so that we don't report spans with count=1
as compressed.
Span compression is purely for improved UI and I don't think it's worth a lot of effort to handle edge cases in high contention cases other than to just drop the odd compression rather than add load. So I think the proposed band aid is sufficient
Describe the bug
From the APM Agents spec (https://github.com/elastic/apm/blob/main/specs/agents/handling-huge-traces/tracing-spans-compress.md?plain=1#L137-L138):
It looks like the java agent is reporting composite spans with a lower count. The following error was observed in ecs logs:
validation error: span: composite: 'count': validation rule 'min(2)' violated
Steps to reproduce
I don't have those unfortunately :(
This is part of the ops KPI review
Expected behavior
Composite count must be > 1
Debug logs
validation error: span: composite: 'count': validation rule 'min(2)' violated