etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
46.83k stars 9.65k forks source link

Distributed Tracing Not Working in v3.5.5+ #16935

Closed ugur99 closed 7 months ago

ugur99 commented 7 months ago

Bug report criteria

What happened?

Hi team, we're trying to enable distributed tracing for etcd, and we've configured it with the required flags according to the documentation. However, it does not seem to work properly with the v3.5.5+ (same config works fine with v3.5.4 without any problems); not sure if its a bug/regression or misdocumentation problem. Do you have any ideas or suggestions on this? Thanks in advance!

What did you expect to happen?

Distributed tracing should work for later versions too; it seems that it does not work with the required config flags that put in documentation.

How can we reproduce it (as minimally and precisely as possible)?

You can check both etcd versions with the same configuration. To providing such a playground you can provision a cluster by kind (or kubespray) and install required apps like tempo, opentelemetry operator. After than provision a collector that uses localNetwork on controlplane nodes and observe the different behaviours between v3.5.4 and v3.5.5+

collector config -->

```yaml apiVersion: opentelemetry.io/v1alpha1 kind: OpenTelemetryCollector metadata: name: collector spec: mode: daemonset nodeSelector: node-role.kubernetes.io/controlplane: "" hostNetwork: true config: | receivers: otlp: protocols: grpc: endpoint: "0.0.0.0:4317" http: endpoint: "0.0.0.0:4318" exporters: logging: otlphttp: endpoint: http://my-tempo.tempo:4318 processors: batch: extensions: health_check: {} service: extensions: [health_check] pipelines: traces: receivers: [otlp] processors: [batch] exporters: [logging, otlphttp] ```

Anything else we need to know?

No response

Etcd version (please run commands below)

```console works with 3.5.4 but does not work with 3.5.5+ ```

Etcd configuration (command line flags or environment variables)

    - --experimental-enable-distributed-tracing=true
    - --experimental-distributed-tracing-address=0.0.0.0:4317
    - --experimental-distributed-tracing-service-name=etcd
    - --experimental-distributed-tracing-instance-id=testtrace

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

No response

Relevant log output

No response

serathius commented 7 months ago

cc @dashpole

jmhbnz commented 7 months ago

Hey @ugur99 thanks for raising this. I'm wondering if what you are seeing is related to the following which was introduced between 3.5.4 and 3.5.5:

It looks like for main / 3.6 we also have a configurable sample rate option: https://github.com/etcd-io/etcd/pull/13248/files#diff-538c79cd00ec18cb43b5dddd5f36b979d9d050cf478a241304493284739d31bfR353

However that sample rate option doesn't appear to be documented yet? Refer: https://etcd.io/docs/v3.6/op-guide/configuration/#experimental-distributed-tracing

We might need to do a backport of the ExperimentalDistributedTracingSamplingRatePerMillion configuration option to resolve this for 3.5 stable release branch.

jmhbnz commented 7 months ago

@serathius can you please add an area/observability label. We don't really have any way of grouping tracing/loging/metrics related issues currently.

ugur99 commented 7 months ago

thank you both @serathius @jmhbnz for the quick replies! It might be a good idea as well to document current status in v3.5 documentation as well.

jmhbnz commented 7 months ago

Have started working on a fix https://github.com/etcd-io/etcd/pull/16951. Need to setup a tracing test environment to validate before marking ready for review.

Edit: Pull request is now ready for review.

jmhbnz commented 7 months ago

Hey @ugur99 - A quick update, the code fix has been merged in #16951 and our documentation updated in https://github.com/etcd-io/website/pull/753. The fix will be available for users very shortly in our upcoming v3.5.11 stable release.

I'll close this issue now as the work is completed, many thanks for reporting this bug 🙏🏻