appsignal / opentelemetry-instrumentation-bullmq

OTel Auto-instrumentation for BullMQ
Apache License 2.0
4 stars 3 forks source link

Sampled TraceId produces recorded spans #13

Closed Steffen911 closed 1 month ago

Steffen911 commented 1 month ago

What happened

We added sampling to our application to reduce the total amount of spans ingested. We use head based sampling and make the decision on the producer side. We use the new useProducerSpanAsConsumerParent flag which means that we'd expect that the sample decision propagates down to the consumer spans. In our case, we still see consumer spans in our observability tool, even though the TraceFlag is set to 00. We validate the latter using the traceparent on the job opts.

What we expect to happen

When the TraceFlags on the producerContext are set to 00 we expect the same thing on the consumer, i.e. the trace should be excluded from recording, if the producerContext is to be used.

Solution approach

@unflxw I think the root cause might be that we still use the consumerContext as active here: https://github.com/appsignal/opentelemetry-instrumentation-bullmq/blob/main/src/instrumentation.ts#L511. Perhaps we should use the parentContext variable in this row as well?

unflxw commented 1 month ago

Hi @Steffen911, thanks for reporting this issue!

Yes, I think you're right that we should use parentContext in that line. (Could you send a PR about it? 🙏)

However, I'm not sure if that will solve the problem you're encountering. In that line, we create a context for the newly created span. But said span was created using trace.startSpan(..., parentContext). If the trace flags were being honored, shouldn't that span have been created as a NonRecordingSpan? 🤔

Do give it a try! If it solves the problem, I'm happy to merge it.

Steffen911 commented 1 month ago

Hey @unflxw,

I'll open the PR for the parentContext, because I think it would be the correct setup anyway. It doesn't seem to fix the issue though based on my local experiments. It seems that the TraceFlags aren't interpreted when creating the span in https://github.com/appsignal/opentelemetry-instrumentation-bullmq/blob/main/src/instrumentation.ts#L474 as the resulting span has a _spanContext.traceFlags = 1 and isRecording() === true.

Do you have any other ideas what it could be?

Steffen911 commented 1 month ago

I added https://github.com/appsignal/opentelemetry-instrumentation-bullmq/pull/14, but it doesn't seem to address this issue.

unflxw commented 1 month ago

@Steffen911 Do you have any other ideas what it could be?

I think the default sampling behavior is to sample all traces, which is what you're seeing, regardless of the trace flags. I think you may need to configure OpenTelemetry to use the ParentBasedSampler, so that it takes into account whether the parent span was sampled. Likely in the way that is described in their documentation for the TraceIDRatioBasedSampler, but for the ParentBasedSampler instead.

Steffen911 commented 1 month ago

@unflxw Thank you so much! That was actually the perfect hint. We used a TraceIDRatioBasedSampler in all of our services which overwrote the ParentBasedSampler (which seems to be the default). So not using the TraceIDRatioBasedSampler fixed it for us.