argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.67k stars 838 forks source link

Datadog query syntax not clear, leading to errors #2169

Open tfrokt opened 2 years ago

tfrokt commented 2 years ago

Describe the bug

This is the query I am using: default_zero(sum:client_event_total{env:prod,cliendid:42}.as_count()) It is working datadog syntax (open metrics in datadog copy/paste works for me), but it is different to the example. The rollout does not succeed.

Full template:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: rollout-analysis-metrics
spec:
  metrics:
  - name: client_events
    interval: 2m
    failureLimit: 5
    successCondition: result > 0
    provider:
      datadog:
        interval: 30m
        query: |
          default_zero(sum:client_event_total{env:prod,cliendid:42}.as_count())

Version

Can I see the version in the logs somehow?

Logs

This is one of the errors:

Error Message: invalid operation: > (mismatched types <nil> and int)"

it is not clear to me where the "nil" comes from, the metric has a sum of ~900 per 10s.

And this is another one:

Operation cannot be fulfilled on rollouts.argoproj.io "client_events": the object has been modified; please apply your changes to the latest version and try again

Expected behavior

a) better error messages b) argo rollout to proceed

While I am here, I had this related question as well.

Happy to provide more info if required.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

tfrokt commented 2 years ago

Another one. Getting SetCanaryScale requires TrafficRouting after trying setting explicit number of replicas to scale to.

From the docs it is not clear why my rollout throws this error.

from my template:

  strategy:
    canary:
      steps:
      - setCanaryScale:
          weight: 100
      - pause:
          duration: "30s"
tfrokt commented 2 years ago

I've played around with it a bit more.

This is what I've tried this time:

strategy:
    canary:
      steps:
      - setWeight: 10  # 10% of running pods, rounded up to at least 1
      - pause: 
          duration: "5m"
      - setWeight: 100
      - pause: 
          duration: "4m"
      analysis:
        templates:
        - templateName: rollout-analysis-metrics
      maxSurge: 1
      maxUnavailable: 1
      dynamicStableScale: true

It is throwing this error:

Rollout <rollout> is invalid: spec.strategy.dynamicStableScale: Invalid value: true: Canary dynamicStableScale can only be used with traffic routing"

But the worst about the issue is, it removed the running, stable deployment! It would be great to have a warning in the documentation that this happens when not used correctly.

tfrokt commented 2 years ago

Adding to my original post

assessed Error due to consecutiveErrors (5) \\u003e consecutiveErrorLimit (4): \\\"Error Message: invalid operation: \\u003e (mismatched types \\u003cnil\\u003e and int)

for analysis template:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: rollout-analysis-metrics
spec:
  metrics:
  - name: client_events
    interval: 1m
    failureLimit: 3
    successCondition: result > 0
    provider:
      datadog:
        interval: 5m
        query: |
          sum:<metric>{env:{{ .Values.environment }},version:{{ .Values.image.tag }}}.as_count()

(part of) the rollout template:

steps:
      - setWeight: 10  # 10% of running pods, rounded up to at least 1
      - pause: 
          duration: "10m"
      - setWeight: 100
      analysis:
        templates:
        - templateName: rollout-analysis-metrics

The canary stays up for less than a minute and gets shut down. Expected behaviour: it stays up for at least 3m (failureLimit * interval).

alexef commented 1 year ago

While evaluating argo-rollouts, I ran into almost all of the issues above.

tfrokt commented 1 year ago

I got it working in the end, but it required a few tweaks:

  1. The analysis template required using this syntax successCondition: default(result, 0) > 0 as described here at the very bottom of the page (it is in the docs). This prevents the mismatched types types error which causes retries immediately(!) and ignores the interval set in the analysis template.
  2. The rollout needed an additional pause step before the actual rollout, e.g. - pause: { duration: 2m } and startingStep: 2 (also in the docs) to start the analysis after the container is ready.
tfrokt commented 1 year ago

Here's one of the analysis templates:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: rollout-analysis-metrics
spec:
  metrics:
  - name: metrics_example
    interval: 1m
    failureLimit: 3
    successCondition: default(result, 0) > 0
    provider:
      datadog:
        interval: 1m
        query: |
          sum:<metric>{env:{{ .Values.environment }},version:{{ .Values.image.tag }}}.as_count()
github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity.