argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.77k stars 868 forks source link

Race condition when updating analysis template #3864

Open VibhuSrivastava opened 1 month ago

VibhuSrivastava commented 1 month ago

Checklist:

Describe the bug

We have observed that there is a race condition when you make changes to argo rollouts analysis templates, sometimes the rollout starts before the analysis template is updated, so an analysis run can begin already before the analysis template is updated.

To Reproduce Since this is a race condition, it’s hard to predict when this will happen

Previous analysis template:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  labels:
    generator-chart: something
  name: something-service-success-rate
  namespace: something
spec:
  metrics:
  - failureLimit: 3
    interval: 1m
    name: avg-success-rage-for-http-requests
    provider:
      datadog:
        apiVersion: v2
        formula: moving_rollup( default_zero(a) / b , 300, 'avg')
        interval: 30s
        queries:
          a: sum:trace.servlet.request.hits.by_http_status{abc}.as_count()
          b: sum:trace.servlet.request.hits.by_http_status{xyz}.as_count()
    successCondition: default(result, 1) >= 0.99

Analysis template to be applied:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  labels:
    generator-chart: something
  name: something-service-success-rate
  namespace: something
spec:
  metrics:
  - failureLimit: 3
    interval: 30s
    name: total-success-rate-for-http-requests
    provider:
      datadog:
        apiVersion: v2
        formula: moving_rollup( (default_zero(a) / b) * 100 , 60, 'avg')
        interval: 5m
        queries:
          a: sum:trace.servlet.request.hits.by_http_status{abc2}.as_count()
          b: sum:trace.servlet.request.hits.by_http_status{xyz2}.as_count()
    successCondition: default(result, 100) >= 95

We noticed that the rollout was aborted because of the old analysis template failing (avg-success-rage-for-http-requests)

Screenshot 2024-10-01 at 22 21 14

Expected behavior

When analysis templates are updated then that any changes to analysis templates should always happen first before an analysis run starts. In the new rollout, the analysis template being run was expected to be total-success-rate-for-http-requests when it actually was avg-success-rage-for-http-requests.

The same change was getting rolled out to multiple environments, and the created analysis template was with total as expected, except for 1 case when it was with avg, so the behaviour is unpredictable.

Screenshots

Version

1.7.2

Logs

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

VibhuSrivastava commented 2 weeks ago

In the end we ended up modifying our kubernetes apply script so that it does an ordered apply (analysis templates first, everything else in the second go) Looks like that has solved the problem, sharing here incase someone else runs into the same issue.