argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.68k stars 840 forks source link

Have AnalysisRun with Datadog provider use 2nd to last datapoint or adjust the query time range #3665

Open andrii-korotkov-verkada opened 2 months ago

andrii-korotkov-verkada commented 2 months ago

Discussed in https://github.com/argoproj/argo-rollouts/discussions/3658

Originally posted by **andrii-korotkov-verkada** June 20, 2024 For a datapoint at time T and rollup interval I, the datapoint's value would be an aggregation of interval [T - I/2, T + I/2], and datapoints are aligned to specific absolute points in time, not relative to now. Thus it means that T can be too close to now and thus a part of interval can be not available, likely leading to interpolation or incomplete values by Datadog. The issue I saw is apdex metric `trace.flask.request.apdex` with avg integration and `fill(last)` showing values >1 and sometimes showing lower values than it should comparing to when looking at metric in UI later. This makes some AnalysisRun to fail mistakenly. To avoid this, query time interval may need to be adjusted from `[now - interval, now]` to `[now - interval - I/2, now-I/2]` or to use a 2nd to last data point if available. Is this something possible to achieve? Or should I look into disabling or changing interpolation? Thanks.
andrii-korotkov-verkada commented 2 months ago

Probably the easiest way is to add another manually configurable parameter for the value of I.

todaywasawesome commented 2 months ago

@meeech to discuss offline modifying query parameters.

andrii-korotkov-verkada commented 2 months ago

I'm experimenting with moving_rollup of 60 seconds as was suggested.

andrii-korotkov-verkada commented 2 months ago

moving_rollup seems to help, but there are still sometimes data points above 1 for apdex metric. Bad data points briefly show up in the UI too, so that's related to Datadog processing of data points on the edge.

andrii-korotkov-verkada commented 1 month ago

Still observed a case where fetched data point were wrong even with moving_rollup, probably need to adjust the time window.

andrii-korotkov-verkada commented 1 month ago

Adjusting the time window won't be that easy, since the metric can be delayed, i.e. there can be 30-60s delay between now and the last data point. We'd need to query till now - 1 or 2 min to mitigate that, which is quite a lot of delay. Maybe I'll just make a configurable delay. But then analysis runs have to be tuned as well since a first few data points may be non-existent.

andrii-korotkov-verkada commented 1 month ago

For the API v1 I can try to conditionally use the 2nd to last data point, since it returns a point list, but for API v2 I don't think I can do this. I'll open a ticket to Datadog to clarify the options, at least for the apdex metric.

andrii-korotkov-verkada commented 1 month ago

It actually may be specific to how apdex is computed. Either way, the ticket to Datadog support has been filed and I hope they'd have some kind of resolution.

shivshav commented 3 weeks ago

Hey @andrii-korotkov-verkada, curious if you ever got resolution from Datadog? We're experiencing something similar with our Datadog metrics possibly not being processed fast enough leading to incomplete data points on analysis.

Did you ever confirm if your issue was specific to apdex or more general?

andrii-korotkov-verkada commented 3 weeks ago

I've pinged them recently, but they are still working on it :(

andrii-korotkov-verkada commented 3 weeks ago

My bet is apdex is particularly bad, given the issue reproduces even in Datadog UI when refreshing multiple times.