cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.46k stars 795 forks source link

Rulers HA #4435

Open alanprot opened 3 years ago

alanprot commented 3 years ago

Is your feature request related to a problem? Please describe. Currently the Ruler's ReplicationFactor is hardcoded to 1.

Running each rule groups in a single ruler cause some problems as described below:

Describe the solution you'd like

In order to achieve HA we could make the ReplicationFactor configurable and run each rule group in multiple rulers. Running the same rule group in multiples rulers should not be a problem as all replicas must use the same slotted intervals to evaluate the rules - in other words, all replicas should use the same timestamps to evaluate the rules and create the metrics.

Below is an example from a POC where 3 rulers are evaluating the same rule group and we can see that the evaluation interval is respected

groups:
- name: test
  interval:  3m

  rules:
  - record: alantestTime
    expr: time()
Screen Shot 2021-08-20 at 2 23 30 PM Screen Shot 2021-08-20 at 2 23 40 PM

The problem now is that we have multiple rulers generating the same metrics and so, getting "Duplicated Samples Errors". One possible solution would be to ignore the DuplicatedSamples Error in the Ruler Pusher but doing so, those samples would still being counted on the ingestor DiscartedSample metric and discovering if the error returned by the ingesters was a "Duplicate Samples Errors" could be challenging on the rulers pusher side - probably a string comparison. I think a better solution would be to make the ingester not throw the "Duplicated Samples Errors" at all WHEN the samples received are being sent by a ruler - fortunately we have this information on the ingesters:

https://github.com/cortexproject/cortex/blob/b4daa22055ffec14311d8b5d2d9429f1bd575dad/pkg/ingester/ingester_v2.go#L937-L944

Describe alternatives you've considered

One option would be to use the HA tracker and let the distributor dedup the duplicated samples. In this case we could use the pod name as __replica__ but we cannot have a single value for the cluster label (as it would cause problem if the shard size > replication Factor) - A possible solution for this would be to calculate the cluster value based on all rulers on the rule group replicaSet (ex: for ruleGroupA we have 3 rulers - sort the 3 rulers and use cluster value).

The problem with this approach is that we don't know what was the rule group that generated the metric on the Pusher implementation - Even though a change in cortex to add this info in the ctx could be done. The other drawback of this solution is that we will add the cluster label to the metrics generated by the rules.

Another option would be to use the distributor haTracker component (in this case we would rafactor to make it usable on by other components) to track in the ruler itself who is the leader for a given rulegroup. This solution has the same problem as the previous one - we dont know what ruler group is generating the metric in the Pusher Interface but we would not add the cluster label to the metrics generated by the rules.

Additional context Any other solution i could came up had to do further changes on prometheus and would not bring huge advantages. Ex: Add the ruleGroup info in the context

qinxx108 commented 3 years ago

Hi after enable the ReplicationFactor of ruler to more than 1 We found the problem is from the ALERTS_FOR_STATE metrics, because of each rulers' activeAt can be different, We get the "duplicate sample for timestamp".

image

The solution can be

  1. Have a UpdateState such as alert manager so that we periodically update the states for the rulers in the ring
  2. Add the ha labels to the ALERTS_FOR_STATE metric. The only problem is the distributor will select which sample to accept. And the result will be unstable.
  3. Have rulers periodically sync from ALERT_FOR_STATE metric and update internal activeAt attribute
bboreham commented 2 years ago

Thanks @qinxx108, that is useful feedback. I don't think anyone started work on Alan's suggestion, but you brought up another item to address.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

alanprot commented 2 years ago

Reopening

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

friedrichg commented 2 years ago

not stale

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.