Field-level metrics sampling rate calculation for looks wrong

SimonSapin commented 1 year ago

In the telemetry plugin, Conf::calculate_field_level_instrumentation_ratio looks at two configuration keys: the "global rate" tracing.trace_config.sampler and "field-level rate" apollo.field_level_instrumentation_sampler. If both are specified as non-zero numbers, they are divided:

https://github.com/apollographql/router/blob/1bbfc16327d954485268cb844564a4fcdaa98a0d/apollo-router/src/plugins/telemetry/config.rs#L614-L617

This result is meaningful as the probability to use to randomly select a request for field-level metrics, assuming it had already been randomly selected with the global rate. Combining probabilities like this effectively multiplies them, the division compensates for that.

However that’s not how it’s used. Instead the result calculate_field_level_instrumentation_ratio is used as a random probability in populate_context() which is run unconditionally:

https://github.com/apollographql/router/blob/1bbfc16327d954485268cb844564a4fcdaa98a0d/apollo-router/src/plugins/telemetry/mod.rs#L851-L853

The default global sampling rate is 100%, so division by 1 doesn’t affect the result. However if it is configured to something less, we might enable field-level metrics more frequently than desired.

Relatedly, field-level metrics contain estimated counts that use a weight to compensate for sampling. This weight is currently based on the result of the division above but probably shouldn’t.

Again this only makes a difference with non-default config. And I think the two bug might cancel each other, so that estimated numbers end up correct anyway.

https://github.com/apollographql/router/blob/1bbfc16327d954485268cb844564a4fcdaa98a0d/apollo-router/src/plugins/telemetry/mod.rs#L1253-L1258

Geal commented 1 year ago

FYI with #2908 and #2894, these sampling rates will become independent. The Apollo telemetry plugin will see all traces, then apply the field level instrumentation sampling

lrlna commented 8 months ago

@SimonSapin is this still the case? I think we've done enough work here that this issue is likely no longer true.

apollographql / router

Field-level metrics sampling rate calculation for looks wrong #2721