apollographql / router

A configurable, high-performance routing runtime for Apollo Federation 🚀
https://www.apollographql.com/docs/router/
Other
798 stars 267 forks source link

Traces/metrics not reported after a while using OTLP exporter #3392

Closed Meemaw closed 11 months ago

Meemaw commented 1 year ago

Describe the bug After a while traces & some metrics stop being reported by the router using OTLP exporter ~ Datadog. The timing here varies, but is usually a few hours. Traces are always missing when this happens, while some metrics are still reported while others are not.

Example of metrics that are still reported:

Example of metrics that dissapear:

This happens on latest version, but has been happening for a long time (half a year at least). I suspect this is a bug in the router, because restarting the deployment always fixes the issue.

Its hard to reproduce this locally obviously, so this is more for tracking and getting information if anyone else is experiencing similar issues.

garypen commented 1 year ago

@Meemaw We recently (1.24.0) add support for temporality for our OTLP metrics. This is designed for use with metrics ingesters such as datadog who prefer aggregation deltas to cumulative (which is the OTEL default).

It may resolve the issues you are seeing. Sample configuration fragment:

...
        metrics:
          otlp:
            temporality: delta
... 

If you could try this and let us know if things improve, that would be helpful. As you note, it is difficult to debug/track, but we have found this improves reporting in the testing we have managed to perform.

Meemaw commented 1 year ago

@garypen will report how it goes.

There is actually 1 thing I noticed after the upgrade. apollo_router_session_count_total metric is now reporting differently, showing negative values at times.

Screenshot 2023-07-20 at 13 24 35
garypen commented 1 year ago

I think that's a different problem, because I've noticed that the apollo_router_session_count values are odd even with cumulative temporality. I agree that it's more obvious with delta. I'll file an issue for that.

garypen commented 1 year ago

see: https://github.com/apollographql/router/issues/3485

abernix commented 1 year ago

Any updates on this, @Meemaw ? 😄

Meemaw commented 1 year ago

@abernix we still see metrics/traces disappearing after a while on v1.26.0.

garypen commented 1 year ago

@Meemaw That's disappointing. We have been using 1.26.0 with delta temporality successfully with datadog over the last couple of weeks.

Meemaw commented 1 year ago

@garypen That's only relevant for metrics, right? Also not seeing traces which shouldn't be affected by that change.

This is our config (in case you see anything wrong):

telemetry:
  metrics:
    common:
      service_name: "${env.DD_SERVICE:-graphql-federation}"
    otlp:
      endpoint: "http://${env.DD_AGENT_HOST:-datadog}:4317"
      temporality: delta
  tracing:
    trace_config:
      service_name: "${env.DD_SERVICE:-graphql-federation}"
      service_namespace: "${env.DD_ENV:-development}"
      sampler: "${env.DD_TRACE_SAMPLE_RATE:-1}"
      parent_based_sampler: true
      attributes:
        version: "${env.DD_VERSION:-development}"
    otlp:
      endpoint: "http://${env.DD_AGENT_HOST:-datadog}:4317"

We have some other services which are using the otlp grpc endpoint and they work without issues.

garypen commented 1 year ago

@Meemaw It is only relevant for metrics, but you wrote: "we still see metrics/traces disappearing after a while on v1.26.0." so I was commenting on the metrics part of that. I should probably have made that clear.

I can't see anything wrong with your config.

Just out of interest, are any of your other functional services written in rust and using the opentelemetry-rust crate?

Meemaw commented 1 year ago

@Meemaw It is only relevant for metrics, but you wrote: "we still see metrics/traces disappearing after a while on v1.26.0." so I was commenting on the metrics part of that. I should probably have made that clear.

I can't see anything wrong with your config.

Just out of interest, are any of your other functional services written in rust and using the opentelemetry-rust crate?

No, others are in Go.

Meemaw commented 1 year ago

@garypen another observation. Metrics emitted by us (in a custom rust plugin) do not disappear.

abernix commented 1 year ago

This is blocked until #3601 is done, so track that one first if you're curious about progress. ;)

garypen commented 1 year ago

One other observation. I have noted that if there is no activity, for whatever reason, around a particular metric for a "while", then our datadog widget just stops reporting data. It's as though it is waiting for more data to arrive before it resumes graphing. Could this be part of the problem you are seeing @Meemaw ? i.e.: rather than metrics that you've previously seen disappearing, what you are seeing is that metrics suddenly stop being updated and then, maybe, later they are updated.

Meemaw commented 1 year ago

One other observation. I have noted that if there is no activity, for whatever reason, around a particular metric for a "while", then our datadog widget just stops reporting data. It's as though it is waiting for more data to arrive before it resumes graphing. Could this be part of the problem you are seeing @Meemaw ? i.e.: rather than metrics that you've previously seen disappearing, what you are seeing is that metrics suddenly stop being updated and then, maybe, later they are updated.

By no activity you mean router having no traffic and metrics not being emitted? We have constant high rps traffic, so this would not be the case.

Meemaw commented 11 months ago

@abernix @garypen this seems to be fixed on newer versions of router 🎉