elastic / elastic-otel-node

https://github.com/elastic/elastic-otel-node/tree/main/packages/opentelemetry-node#readme
Apache License 2.0
7 stars 3 forks source link

Traces sampler also total transactions #438

Open ezioda004 opened 1 week ago

ezioda004 commented 1 week ago

Hello,

I have a custom OTEL based instrumentation which I was testing out with ElasticNodeSDK. I can see that with 1% sample rate, total spans and total transaction count metrics both are being sampled. I'm expecting only traces to be sampled and not transaction count.

Any way to fix this?

Below were the config:

OTEL_TRACES_SAMPLER="traceidratio"
OTEL_TRACES_SAMPLER_ARG="0.01"

I'm using Elastic Observability Cloud for APM.

david-luna commented 1 week ago

Hi @ezioda004,

thanks for using out Elastic's Distribution of OpenTelemetry for Node.js :)

OpenTelemetry does not have the concept of a transaction and only exports Spans when instrumenting a service. What corresponds to a transaction is the Span of the incoming HTTP request. Elastic stack is smart enough to find which spans are transactions and shows them into your Kibana service detail view.

Samplers work on every Span including the ones that correspond to a transaction. So that's why you see transactions being sampled as well. I need to double check but I think transaction count is calculated using a query to Elastisearch.

Worth mentioning that this configuration will sample spans regardless if parent span was sampled out or not and maybe is not what you want. With this config a sampled span (a transaction one) may have some of its child spans sampled out resulting in gaps on the traces you look at Kibana. If you want to collect all spans form a sampled root span (aka transaction) you can set OTEL_TRACES_SAMPLER="parentbased_traceidratio" so child spans of a sampled one are always sampled as well. With this configuration of sampler you will get a similar behaviour that transactionSampleRate gives to elastic-apm-node

ezioda004 commented 1 week ago

Thanks for the quick response @david-luna.

From the elastic docs:

Regardless of the sampling decision, all traces retain transaction and error data. This means the following data will always accurately reflect all of your application’s requests, regardless of the configured sampling rate:

  • Transaction duration and transactions per minute
  • Transaction breakdown metrics
  • Errors, error occurrence, and error rate

Which does not seem to be the case here. I tried parentbased_traceidratio, but still the total requests, TPM, etc are also sampled.

david-luna commented 1 week ago

Hi @ezioda004

This was dropped also for APM agents sending data to APM server with version > v8.0. You can check it in the specs https://github.com/elastic/apm/blob/main/specs/agents/tracing-sampling.md#non-sampled-transactions

It may be possible to implement your own sampler and pass it to ElasticNodeSDK configuration but I need to investigate further.

ezioda004 commented 1 week ago

Sure @david-luna

I guess what I'm looking for is a way to sample transactions/span on application level but send complete transactions/span metrics which are mentioned here.

Because currently, sending 100% sampled spans is reducing my applications's throughput by 2x in terms of TPM. So I want to keep config like 10% sampled spans and 100% endpoint metric as a balance.

david-luna commented 5 days ago

@ezioda004

I've dug a bit on Samplers and I cannot see a clear way of having this behavior. However I think you may achieve what you want with a custom implementation by:

This way you will always have the root span of the incoming HTTP request and:

This is a very simple example so you get the idea.

const { context } = require('@opentelemetry/api');
const { suppressTracing } = require('@opentelemetry/core');

// Utility function
function toSampledFn (fn) {
  return function () {
    const shouldSample = // your sampling logic here;
    const self = this;

    if (shouldSample) {
      return fn.apply(self, arguments); 
    }

    return context.with(suppressTracing(context.active()), () => fn.apply(self, arguments));
  }
}

// in your code
app.use(toSampledFn(function (req, res, next) {
  // Your app logic here
}));
ezioda004 commented 5 days ago

Hi @david-luna

Thanks for this, I'll try this. This will do custom sampling, which should work for sampling out spans. I'm not clear on how this will ensure that transaction related metrics will be counted and sent to elastic APM. Could you help me understand that?

david-luna commented 5 days ago

Could you help me understand that?

Usually when your request handler kicks in @opentelemetry/instrumentation-http already started a Span for the incoming request. Whatever operations you do inside the handler will run within the context of that root Span. The line

return context.with(suppressTracing(context.active()), () => fn.apply(self, arguments));

tells Opentelemetry API to run the target function with a given context. In that case the same context modified to produce NoopSpans if any instrumentation or user code used the API the start a new Span.

Then at export time the root span is sent and all child NoopSpans are dropped resulting on only having the one corresponding to a transaction.