apollographql / router

A configurable, high-performance routing runtime for Apollo Federation 🚀
https://www.apollographql.com/docs/router/
Other
807 stars 272 forks source link

OpenTelemetry trace error #2058

Open deweyjose opened 1 year ago

deweyjose commented 1 year ago

Describe the bug We're seeing the following error in our Router logs:

{"timestamp":"2022-11-07T14:34:48.904971Z","level":"ERROR","fields":{"message":"OpenTelemetry trace error occurred: error sending request for url (*********/traces): connection closed before message completed"},"target":"apollo_router::plugins::telemetry"}
o0Ignition0o commented 1 year ago

Thanks for opening this issue! This could be tied to #2046

kmcrawford commented 1 year ago

I'm seeing a similar issue, but may not be related:

{"timestamp":"2022-11-08T19:24:19.723074Z","level":"ERROR","fields":{"message":"OpenTelemetry trace error occurred: oneshot canceled"},"target":"apollo_router::plugins::telemetry"}

Running v1.2.1

BrynCooke commented 1 year ago

Not sure about the oneshot, but the other issue is likely to be caused by the BatchSpanProcessor defaults not being good for production.

In the short term the following env variables are available to tweak: https://opentelemetry.io/docs/reference/specification/sdk-environment-variables/#batch-span-processor

Longer term we are looking to add this configuration to the router.yaml

BrynCooke commented 1 year ago

Going to close this for now in favor of #1789 as I think this is resolved by tweaking the batch span processor settings. If it still persists then feel free to reopen.

jtribble commented 9 months ago

This issue is till occurring in v1.37.0 (#1789 appears to be a different issue; notice the error message is different).

Here are the relevant error messages:

OpenTelemetry trace error occurred: error sending request for url (http://localhost:8126/v0.5/traces): connection closed before message completed
OpenTelemetry trace error occurred: error sending request for url (http://localhost:8126/v0.5/traces): connection error: Connection reset by peer (os error 104)

Here's my telemetry config (notice the errors are related to the datadog tracer):

telemetry:
  exporters:
    metrics:
      common:
        service_name: ${env.DD_SERVICE:-graphql-router}
      prometheus:
        enabled: true
        listen: 0.0.0.0:9090
        path: /metrics
    tracing:
      common:
        service_name: ${env.DD_SERVICE:-graphql-router}
      datadog:
        enabled: true
        endpoint: ${env.DD_ENDPOINT}
        enable_span_mapping: true

Could be related to https://github.com/open-telemetry/opentelemetry-rust-contrib/issues/7

Geal commented 7 months ago

reopening as it is apparently still not resolved and we are getting more reports of it. Apparently the Datadog agent is correctly receiving the trace, but abruptly closing the connection. Other libraries have workarounds for it: https://github.com/will-bank/datadog-tracing/blob/30cdfba8d00caa04f6ac8e304f76403a5eb97129/src/tracer.rs#L29

BrynCooke commented 3 months ago

I had a go at this and although better it still produced occasional errors in the logs.