Open kira-u opened 1 year ago
I reran my test scenario on a fresh server to confirm whether it was tag not applied, or the span was completely lost. After another server restart, for the same scenario of 2 requests to subg A and 2 requests to subg B, I got for the (left - subgraph Tag, right - metric value)
suite 1 N/A = 4 A = 2 B = 1 // must be 2, 1 means the metric is lost
suite 2 (another server restart) N/A = 4 A = 2 B = 2
From what I can see, this issue is unstable
Describe the bug Metric:
apollo_router_http_requests_total
Scenario: performed 10 queries to a subgraph A Observed result: Out of 10 requests, only 9 were tagged withsubgraph
. The tag was not applied to the first request. Subsequent requests were tagged fine. Without restarting the router, the behavior repeated when also called subgraph B, making the issue per-subgraph level.To Reproduce Router version: 1.29.1 Router telemetry config:
Steps to reproduce the behavior:
sum:apollo_router_http_requests_total{$env,$region,$service} by {subgraph}.as_count()
would show that there: 1 request to subgraph A, 1 request to subgraph B, 4 requests with subgraph tag value = N/A.Expected behavior The first request to a subgraph should be properly tagged with a subgraph name.
Desktop (please complete the following information):
Additional context Some could say the problem is not really important since it applied only to the first request to a subgraph, but there could be subgraphs that are invoked rarely, multiplied by the amount of router instances and frequency of restarts, the data loss could be noticeable (well, we noticed it, so it is noticeable:) ) I did not check any other metrics, only the one mentioned at the beginning.