envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25.12k stars 4.82k forks source link

opentelemetrytracer: allow configuring `flush_interval_ms` and `min_flush_spans` with tracer config #36422

Open joaopgrassi opened 1 month ago

joaopgrassi commented 1 month ago

Title: allow configuring flush_interval_ms and min_flush_spans with tracer config

Description: The OpenTelemetry tracer has two runtime settings that control the behavior of when spans are exported. One is flush_interval_ms which controls the timeout for a export to happen and the other is min_flush_spans which controls the size of the batch to trigger also a export.

Both of these today need to be enabled via the runtime config. I may be lacking context but when using Istio, to be able to change these I would need to deploy a ConfigMap and add a annotation to all my pods in order for Istio to know that it needs to "merge" this config when injecting the sidecar proxies.

I'm not sure if deploying configmaps is a common thing and users are used to it, but as a OpenTelemetry contributor/user, such configurations are part of the "SDK" configuration. For ex, the batch processor has similar parameters. Similarly, the OTel collector also has the same.

So I thought Envoy could also expose these two properties as part of the "normal" OTel tracer:

provider:
  name: envoy.tracers.opentelemetry
  typed_config:
    "@type": type.googleapis.com/envoy.config.trace.v3.OpenTelemetryConfig
    service_name: envoy-HTTP-exporter
    [other config]
    flush_interval_ms: 6000 <<<<<< NEW
    min_flush_spans: 512    <<<<<< NEW

[optional Relevant Links:] https://github.com/envoyproxy/envoy/issues/35997#issuecomment-2388492731

kyessenov commented 1 month ago

CC @AlexanderEllis

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

joaopgrassi commented 3 weeks ago

Can we make this never stale?

arielvalentin commented 3 weeks ago

Am I correct in my current understanding of the OTel Tracer implementation.

Is it currently using a default of min_flush_spans to 5 spans?

https://github.com/envoyproxy/envoy/blob/52b8519fb5c9f94e00e57c129c1eb1a6eaf9f5a7/source/extensions/tracers/opentelemetry/tracer.cc#L247

Does that mean a batch will have at most 5 spans before exporting it to the collector?

https://github.com/envoyproxy/envoy/blob/52b8519fb5c9f94e00e57c129c1eb1a6eaf9f5a7/source/extensions/tracers/opentelemetry/tracer.cc#L248

joaopgrassi commented 3 weeks ago

Yes, that's also my understanding. The values are very low. Together with this change in allowing them to be configured like the others settings are, I'd also like to increase the defaults, so they align with the BatchSpanProcessor spec: https://opentelemetry.io/docs/specs/otel/trace/sdk/#batching-processor

arielvalentin commented 3 weeks ago

And there isn't any way to set them now because the values are absent in the proto correct?

joaopgrassi commented 3 weeks ago

And there isn't any way to set them now because the values are absent in the proto correct?

You can set them yes. They are set via the runtime config https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/runtime

For example, when using Istio, they can be globably set via the MeshConfig:ProxyConfig https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/#ProxyConfig like so:

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    defaultConfig:
      runtimeValues:
        tracing.opentelemetry.flush_interval_ms: "300000" <<<<< ADD HERE

Those land then as a static layered config in Envoy:

      - name: global config
        static_layer:
          envoy.deprecated_features:envoy.config.listener.v3.Listener.hidden_envoy_deprecated_use_original_dst: true
          envoy.reloadable_features.http_reject_path_with_fragment: false
          overload.global_downstream_max_connections: "2147483647"
          re2.max_program_size.error_level: "32768"
          tracing.opentelemetry.flush_interval_ms: "300000"

The "problem" is that it's not that obvious to set them like this and I'd like them to be part of the proto, together with the rest of the other OTel tracer configs.