VictoriaMetrics / VictoriaMetrics

VictoriaMetrics: fast, cost-effective monitoring solution and time series database
https://victoriametrics.com/
Apache License 2.0
12.08k stars 1.2k forks source link

series fetched may report different values for the same query depending on whether cache was hit #7170

Open hartfordfive opened 1 week ago

hartfordfive commented 1 week ago

Is your question request related to a specific component?

vmselect

Describe the question in detail

I've been running a custom comparison script to compare the execution times of queries against Victoriametrics cluster and Prometheus-compatible backends such as Prometheus and Thanos. The script, which fetches a list of top slowest queries from a centralized logging source, extracts the query, then runs it 5 times against both Victoriametrics (cluster) and Thanos.

To my surprise, I've noticed that some executions of the same query return a different number of series fetched even though it's the same query, same start/end time, and the time between the executions is only a few seconds some executions show less series being returned.

Query:

max_over_time(node_network_speed_bytes{device!~\".*(bridge|podman).*\",hostname!=\"\",job!~\"kubernetes-service-endpoints.*\"}[3d]) 
> 
node_network_speed_bytes{device!~\".*(bridge|podman).*\",hostname!=\"\",job!~\"kubernetes-service-endpoints.*\"}

Execution 1:

{
  "status": "success",
  "isPartial": false,
  "data": {
    "resultType": "vector",
    "result": []
  },
  "stats": {
    "seriesFetched": "375535",
    "executionTimeMsec": 1269
  },
  "trace": {
  ...

Execution 2:

{
  "status": "success",
  "isPartial": false,
  "data": {
    "resultType": "vector",
    "result": []
  },
  "stats": {
    "seriesFetched": "179290",
    "executionTimeMsec": 1135
  },
  "trace": {
  ...

I don't understand why this is occurring and considering the trace of the query is quite large, I'm not certain what to look for? I've attached the sample traces as a gist here. I would very much appreciate some help trying to understand and resolve this issue.

Troubleshooting docs

hagen1778 commented 1 week ago

Hey @hartfordfive! Is this only seriesFetched different? Do you observe difference in resulting value? Can you check the final number of returned series in the end of the trace - is it the same on repeated exeutions?

hartfordfive commented 1 week ago

Yes, in all the cases I observed, the resulting number of series returned was the same from one execution to the next even though the number of series fetched was different.

hagen1778 commented 1 week ago

Yes, in all the cases I observed, the resulting number of series returned was the same from one execution to the next even though the number of series fetched was different.

Then, I think, it is just a bug in reporting of this stat. In this PR https://github.com/VictoriaMetrics/VictoriaMetrics/pull/7181 I made an attempt to fix one of such spots. But there is another spot that would have a "cost of fixing"(compute-wise), and I'm hesitant to do this. Mostly because seriesFetched has no impact on anything. Its main purpose was to highlight whether query matches at least something - see how we use it vmalert https://docs.victoriametrics.com/vmalert/#never-firing-alerts

hartfordfive commented 1 week ago

While I can understand and appreciate your concern regarding the technical cost of fixing the bug, having this type of result can and does raise concerns from users who are evaluating the solution as a potential replacement candidate. If there is an expectation of consistency across queries performed the same way on the identical time range, one wouldn't expect a difference in the number of series fetched, especially not a drastic one like you've seen in my sample traces. This concern becomes especially important when upper level technical management (such as a CTO) start asking questions as to why these discrepancies are being observed.

If there is technically no harm or observable down-side to this issue, then it might remain to be discussed if it should in fact be addressed or not, but at the very minimum there should be some clear official documentation that states this issue and that explains it well while providing assurances that it doesn't have a real impact on anything except the reported value of the seriesFetched field.

The only area where I could see this potentially having an impact might be in any integration tests performed for individual alerting conditions prior to them being released to vmalert instances to ensure the queries are actually matching series which exist. Could there be any possible edge cases where the seriesFetched value might be 0 for some executions and be greater than 0 in others for the same query?

hagen1778 commented 4 days ago

but at the very minimum there should be some clear official documentation that states this issue and that explains it well while providing assurances that it doesn't have a real impact on anything except the reported value of the seriesFetched field.

The field seriesFetched in the VM response body is undocumented. It is not mentioned anywhere in our recommendations, so users shouldn't rely on it in their tests or use it to count the exact series touched by a query until the documentation suggests or states otherwise.

Could there be any possible edge cases where the seriesFetched value might be 0 for some executions and be greater than 0 in others for the same query?

It shouldn't be, no.

hartfordfive commented 4 days ago

The field seriesFetched in the VM response body is undocumented. It is not mentioned anywhere in our recommendations, so users shouldn't rely on it in their tests or use it to count the exact series touched by a query until the documentation suggests or states otherwise.

I haven't seen it in your documentation although it has been published in this blog post which does tend to have a fair amount of traction in my opinion (considering it was an addition of a great feature!). With that being the case, it should be expected that others will start using this field either directly in their CICD pipelines (to confirm queries are working as expected) or to trace queries for performance/debugging purposes (such as I've done).

Could there be any possible edge cases where the seriesFetched value might be 0 for some executions and be greater than 0 in others for the same query?

Given there shouldn't be any cases where seriesFetched could be < 0 in some cases and > 0 in others, then it may be reasonable for now to at least document this somewhere saying there is a known issue and that it has no impact and that anyone relying on this for existence of metrics should simply verify the value is >= 1. Adding a link to the active Github issue would also be good as it allows users to better understand the bug and at least gives some type of transparency. This could also help avoiding other related issues such as this one from being submitted.