Not all traces are getting to GCP

dashpole commented 7 months ago

Original issue: https://github.com/open-telemetry/community/discussions/2082 by @sdsani

Hi Guys, Our team has been asked to deploy an application in google cloud run as a service. Among many other tasks, one action item is about setting up tracing and monitoring for the application. We decided here to use Open telemetry here. We have two services. One is a ReactJS app while second is a spring boot app. Both are using otel collector as a side car here. Below are the few docs that we have followed here to configure our application.

https://cloud.google.com/run/docs/tutorials/custom-metrics-opentelemetry-sidecar https://opentelemetry.io/docs/languages/java/automatic/spring-boot/ And many other docs.

We got it working at the end. We can see traces, showing up in google cloud traces. Metrics are getting into metrics and so on. However, during testing, we found that not all traces are getting into the google cloud trace. Sometime every second trace get into cloud trace, while sometime every fourth get into the cloud trace. We have looked many different settings and played around that, however, none of these settings appears to make any difference. To get this capability, we need a more reliable config in place since for production support, missing these traces would not be an option.

Below is the image that we are using for otel collector sidecar (pulled from dockerhub). otel/opentelemetry-collector-contrib:0.99.0

This image has been customized with a config file and attached is that config file (collector-config.yaml.txt) Attached is also yaml file for cloud run service config (cloud-run-service-config.yaml.txt) Please advise. cloud-run-service-config.yaml.txt collector-config.yaml.txt

In our spring boot config, our team is already setting sampling rate to 1.0. Have tried adding following to the env for the collector sidecar also and this is not helping either. env:

name: OTEL_TRACES_SAMPLER value: traceidratio
name: OTEL_TRACES_SAMPLER_ARG value: "1.0"

dashpole commented 7 months ago

cc @ridwanmsharif @damemi

dashpole commented 7 months ago

Have tried adding following to the env for the collector sidecar also and this is not helping either.

You would need to configure sampling in the auto-instrumentation agent, rather than in the collector, since the sampling decision is made by the auto-instrumentation agent. See https://opentelemetry.io/docs/languages/java/automatic/configuration/ for the configuration, and https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk-extensions/autoconfigure/README.md#sampler for the sampler config for the agent.

Another suggestion, if the above doesn't fix it, is to remove the batch processor, as that can cause delay between the application and export.

sdsani commented 7 months ago

Thanks for the quick response here. We are using Spring boot open telemetry starter here and therefore, have followed this link. https://opentelemetry.io/docs/languages/java/automatic/spring-boot/

Below is my gradle config.

implementation("io.opentelemetry.instrumentation:opentelemetry-spring-boot-starter")
    implementation("io.opentelemetry.contrib:opentelemetry-samplers:1.35.0-alpha")
    implementation("io.opentelemetry.contrib:opentelemetry-gcp-resources:1.35.0-alpha") {
        exclude group: 'com.fasterxml.jackson.core', module: 'jackson-core'
    }

Below is my application property config where I am setting up the sampling.

management:
  tracing:
    sampling:
      probability: 1.0
  otlp:
    metrics:
      distribution:
        slo.test:
          timer: "10.0,100.0,500.0,1000.0"
        percentiles:
          test:
            timer: "0.9,0.99"
        percentiles-histogram:
          test:
            timer: true
      export:
        enabled: true
      tracing:
        export:
          timeout: 5s
          compression: gzip

And below is my OTEL config

otel:
  exporter:
    otlp:
      protocol: grpc
  instrumentation:
    spring-webflux:
      enabled: true
    spring-web:
      enabled: true
  propagators:
    - tracecontext
  resource:
    providers:
      gcp:
        enabled: true
    attributes:
      service.name: app-name-here
      development.environment: gcp

Any idea what I am missing here?

sdsani commented 7 months ago

I have already tried removing batch from the exporter and that makes no change.

dashpole commented 7 months ago

I'm not particularly familiar with the config formats you've shared, but I suspect you need to set the sampler. The default from OTel is parent-based, always-on. Since cloud run creates a parent span and decides whether or not to sample, you will inherit the sampling decision that cloud run made, even if you have the sampling rate set at 100%

sdsani commented 7 months ago

Is there a reason we can control setting on cloud run side? I hope you are getting my point. What there is a trace that we are interested in due to some reason and it does not show up in the google cloud traces?

dashpole commented 7 months ago

I don't believe it is configurable on cloud run today, although cloud run does mostly respect the traceparent header's sampling decision (up to point, and then will rate limit), so if requests to the cloud run service are already sampled, you should get nearly 100% sampling.

sdsani commented 7 months ago

Our team found following link and this link shows that Cloud run does not support configuration of Cloud run sample rate. https://cloud.google.com/run/docs/trace#trace_sampling_rate At the same point it shows that request per service are sampled at 10 requests per second. For my testing, I am sure that I am not crossing this limit.

GoogleCloudPlatform / opentelemetry-cloud-run

Not all traces are getting to GCP #18