Closed Meemaw closed 11 months ago
@Meemaw We recently (1.24.0) add support for temporality for our OTLP metrics. This is designed for use with metrics ingesters such as datadog who prefer aggregation deltas to cumulative (which is the OTEL default).
It may resolve the issues you are seeing. Sample configuration fragment:
...
metrics:
otlp:
temporality: delta
...
If you could try this and let us know if things improve, that would be helpful. As you note, it is difficult to debug/track, but we have found this improves reporting in the testing we have managed to perform.
@garypen will report how it goes.
There is actually 1 thing I noticed after the upgrade. apollo_router_session_count_total
metric is now reporting differently, showing negative values at times.
I think that's a different problem, because I've noticed that the apollo_router_session_count
values are odd even with cumulative
temporality. I agree that it's more obvious with delta
. I'll file an issue for that.
Any updates on this, @Meemaw ? 😄
@abernix we still see metrics/traces disappearing after a while on v1.26.0.
@Meemaw That's disappointing. We have been using 1.26.0 with delta temporality successfully with datadog over the last couple of weeks.
@garypen That's only relevant for metrics, right? Also not seeing traces which shouldn't be affected by that change.
This is our config (in case you see anything wrong):
telemetry:
metrics:
common:
service_name: "${env.DD_SERVICE:-graphql-federation}"
otlp:
endpoint: "http://${env.DD_AGENT_HOST:-datadog}:4317"
temporality: delta
tracing:
trace_config:
service_name: "${env.DD_SERVICE:-graphql-federation}"
service_namespace: "${env.DD_ENV:-development}"
sampler: "${env.DD_TRACE_SAMPLE_RATE:-1}"
parent_based_sampler: true
attributes:
version: "${env.DD_VERSION:-development}"
otlp:
endpoint: "http://${env.DD_AGENT_HOST:-datadog}:4317"
We have some other services which are using the otlp grpc endpoint and they work without issues.
@Meemaw It is only relevant for metrics, but you wrote: "we still see metrics/traces disappearing after a while on v1.26.0." so I was commenting on the metrics part of that. I should probably have made that clear.
I can't see anything wrong with your config.
Just out of interest, are any of your other functional services written in rust
and using the opentelemetry-rust
crate?
@Meemaw It is only relevant for metrics, but you wrote: "we still see metrics/traces disappearing after a while on v1.26.0." so I was commenting on the metrics part of that. I should probably have made that clear.
I can't see anything wrong with your config.
Just out of interest, are any of your other functional services written in
rust
and using theopentelemetry-rust
crate?
No, others are in Go.
@garypen another observation. Metrics emitted by us (in a custom rust plugin) do not disappear.
This is blocked until #3601 is done, so track that one first if you're curious about progress. ;)
One other observation. I have noted that if there is no activity, for whatever reason, around a particular metric for a "while", then our datadog widget just stops reporting data. It's as though it is waiting for more data to arrive before it resumes graphing. Could this be part of the problem you are seeing @Meemaw ? i.e.: rather than metrics that you've previously seen disappearing, what you are seeing is that metrics suddenly stop being updated and then, maybe, later they are updated.
One other observation. I have noted that if there is no activity, for whatever reason, around a particular metric for a "while", then our datadog widget just stops reporting data. It's as though it is waiting for more data to arrive before it resumes graphing. Could this be part of the problem you are seeing @Meemaw ? i.e.: rather than metrics that you've previously seen disappearing, what you are seeing is that metrics suddenly stop being updated and then, maybe, later they are updated.
By no activity you mean router having no traffic and metrics not being emitted? We have constant high rps traffic, so this would not be the case.
@abernix @garypen this seems to be fixed on newer versions of router 🎉
Describe the bug After a while traces & some metrics stop being reported by the router using OTLP exporter ~ Datadog. The timing here varies, but is usually a few hours. Traces are always missing when this happens, while some metrics are still reported while others are not.
Example of metrics that are still reported:
Example of metrics that dissapear:
This happens on latest version, but has been happening for a long time (half a year at least). I suspect this is a bug in the router, because restarting the deployment always fixes the issue.
Its hard to reproduce this locally obviously, so this is more for tracking and getting information if anyone else is experiencing similar issues.