VictoriaMetrics / VictoriaMetrics

VictoriaMetrics: fast, cost-effective monitoring solution and time series database
https://victoriametrics.com/
Apache License 2.0
12.05k stars 1.19k forks source link

vmalert: add rules backfilling API #6328

Open Haleygo opened 4 months ago

Haleygo commented 4 months ago

Is your feature request related to a problem? Please describe

vmalert supports alerting and recording rules backfilling (aka replay) as a cli-tool, and exits immediately after work is done. It provides a interface like this to display the execution progress.

image

We normally recommend to use jobs like in K8s to perform replay operations. But sometimes, user who manages rules doesn't have permission to create k8s resources or don't want to have extra code to manage those jobs.

Describe the solution you'd like

  1. Adding /replay API in vmalert, which accepts rule files, replay.timeFrom and replay.timeTo timestamps, performs backfilling like replay mode.
    • Pros: avoid using external jobs;
    • Cons:
      • replay operation could be pretty expensive when rule expression returns tons of time series, in this case, vmalert requires extra resources apart from the normal usage. Those resources can be hard to estimate and wasted after replay. And when replay is executing with normal rule evaluations, it could delay normal rule evaluations or crash vmalert for reason like OOM.
      • replay can take long, user can't tell when it's gonna finish(can expose metric replay_rule_queue_number).
  2. Adding replay options to rule group and rule, vmalert will try replaying recording rules when this group/rule starts(options only valid for recording rules, we check if this rule has been replayed before by querying datasource for replay successful metric vmalert_replay_successed{group="", id=<rule parameter hash>, } first).

    ##The name of the group. Must be unique within a file.
    name: 
    
    ##replay it when start this group for the first time
    replay: true
    replayFor: 30d ## replay for last 30days
    // replayFrom: 2024-05-11T07:21:43Z ## replay for specific timerange
    // replayTo: 2024-05-11T07:21:43Z
    replayInterval: 1d
    • Pros: easy to use;
    • Cons: easy to overuse, and has the same resource problem as adding /replay API.

    Q:

    1. what if replay job failed? Since replay is a backgroud job, we will keep retry until it succeed. And user should find useful metrics and logs about failed retry, and fix it if it's datasource issue.
    2. when can we remove the replay param from group? It's better to remove this replay: true param when group replay is done, this can be checked by vmalert logs or metric vmalert_replay_successed{group="", id=<rule parameter hash>, } value=FinishedTimestamp. But it also ok to not remove this parameter immediately after replay is over. By default, we check vmalert_replay_successed{group="", id=<rule parameter hash>, } value=FinishedTimestamp for 30 days, and skip the rule replay if it's already successed in 30 days.The param is checked when group start(group start happens when vmalert starts or group been created/updated), we do extra query absent_over_time(recording_rule_name[30d]) to datasource to determine if this rule needs to be replayed this time.
    3. How is this work on HA vmalert? In HA vmalerts situation, due to the query and result remoteWrite latency, there could be multiple vmalert instances performing rule replay simultaneously or just some of them does(the extra query check gets data). And that's ok if VictoriaMetrics server has configured with deduplication.

About extra resource for above proposals, we should have some default limits, including:

  1. maxConcurrentReplayRules: limits the number of rules that perform replay concurrently;
  2. maxResponseSeries: limit the number of time series a single replay query can return; ...
mblls commented 4 months ago

The second option listed would be an amazing add. I could even forsee something like:

##The name of the group. Must be unique within a file.
name: 

##replay it when start this group for the first time
replay: true
replayFrom: 30d

This would eliminate the need to specify and manage specific dates, especially if the rule group is updated semi-frequently. Thoughts?

Haleygo commented 4 months ago

The second option listed would be an amazing add. I could even forsee something like:

##The name of the group. Must be unique within a file.
name: 

##replay it when start this group for the first time
replay: true
replayFrom: 30d

This would eliminate the need to specify and manage specific dates, especially if the rule group is updated semi-frequently. Thoughts?

Of course, if we decide to implement the second option.