Failed job results in successful analysis run

nextrevision commented 4 years ago

I'm using the Job metric provider for pre-promotion validation in a b/g scenario. The job results in failure (expected) but the analysis run still reports Successful. I expect the analysis run to also fail and cause the revision to be ineligible for promotion (automated or manual) unless otherwise ignored. If I set autoPromotionEnabled to true on my Rollout, the revision with the failed Job will be promoted automatically.

Rollout Status

$ kubectl argo rollouts -n example get rollout myapp
Name:            myapp
Namespace:       example
Status:          ॥ Paused
Strategy:        BlueGreen
Images:          registry.company.io/myapp:1.0.0 (active, preview)
Replicas:
  Desired:       1
  Current:       2
  Updated:       1
  Ready:         2
  Available:     1

NAME                                                       KIND         STATUS        AGE    INFO
⟳ myapp                                                    Rollout      ॥ Paused      41h
├──# revision:32
│  ├──⧉ myapp-64fc844b69                                   ReplicaSet   ✔ Healthy     113s   preview
│  │  └──□ myapp-64fc844b69-ptx8v                          Pod          ✔ Running     113s   ready:1/1
│  └──α myapp-64fc844b69-32                                AnalysisRun  ✔ Successful  48s    ✖ 1
│     └──⊞ e2f2718a-e8d4-4a7c-8867-7d8e17e6b01d.smoketest.1  Job          ✖ Failed      48s
├──# revision:31
│  ├──⧉ myapp-56c5c9749d                                   ReplicaSet   ✔ Healthy     4m35s  active
│  │  └──□ myapp-56c5c9749d-5qnjr                          Pod          ✔ Running     4m35s  ready:1/1
│  ├──α myapp-56c5c9749d-31.1                              AnalysisRun  ✔ Successful  3m36s  ✔ 1
│  │  └──⊞ 739b3158-dc61-4131-a09a-2b0f09a074a2.smoketest.1  Job          ✔ Successful  3m36s

$ kubectl -n example get pods
NAME                                                   READY   STATUS    RESTARTS   AGE
e2f2718a-e8d4-4a7c-8867-7d8e17e6b01d.smoketest.1-f4n9k   0/1     Error     0          2m20s
myapp-56c5c9749d-5qnjr                                 1/1     Running   0          6m7s
myapp-64fc844b69-ptx8v                                 1/1     Running   0          3m25s

$ kubectl -n example get jobs
NAME                                             COMPLETIONS   DURATION   AGE
e2f2718a-e8d4-4a7c-8867-7d8e17e6b01d.smoketest.1   0/1           2m29s      2m29s

$ kubectl -n example describe job e2f2718a-e8d4-4a7c-8867-7d8e17e6b01d.smoketest.1
Name:           e2f2718a-e8d4-4a7c-8867-7d8e17e6b01d.smoketest.1
Namespace:      example
Selector:       controller-uid=fc389554-01a1-4fee-84e8-76777e857e14
Labels:         analysisrun.argoproj.io/uid=e2f2718a-e8d4-4a7c-8867-7d8e17e6b01d
Annotations:    analysisrun.argoproj.io/metric-name: smoketest
                analysisrun.argoproj.io/name: myapp-64fc844b69-32
Controlled By:  AnalysisRun/myapp-64fc844b69-32
Parallelism:    1
Completions:    1
Start Time:     Thu, 28 May 2020 12:05:22 -0700
Pods Statuses:  0 Running / 0 Succeeded / 1 Failed
Pod Template:
  Labels:  controller-uid=fc389554-01a1-4fee-84e8-76777e857e14
           job-name=e2f2718a-e8d4-4a7c-8867-7d8e17e6b01d.smoketest.1

Analysis Template

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: smoketest
spec:
  args:
  - name: service-url
  metrics:
  - name: smoketest
    failureLimit: 1
    provider:
      job:
        spec:
          backoffLimit: 0
          template:
            spec:
              containers:
              - name: smoketest
                image: smoketest:image
                args:
                  - "{{ args.service-url }}"

Rollout

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  annotations:
    rollout.argoproj.io/revision: "32"
  name: myapp
  namespace: example
  resourceVersion: "3948767"
spec:
  progressDeadlineSeconds: 300
  replicas: 1
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: myapp
  strategy:
    blueGreen:
      activeService: myapp
      autoPromotionEnabled: false
      prePromotionAnalysis:
        args:
        - name: service-url
          value: http://myapp-preview.example.svc.cluster.local:8080
        templates:
        - templateName: smoketest
      previewService: myapp-preview
  template:
    metadata:
      labels:
        app.kubernetes.io/name: myapp
        app.kubernetes.io/version: 1.0.0
    spec:
      containers: [...]
      restartPolicy: Always
      terminationGracePeriodSeconds: 160

khewling commented 4 years ago

We are seeing the same issue:

ruben-ojeda-mednax commented 4 years ago

Please, fix this issue. The webmetric_test.go file only has scenarios where webServerStatus == 200, but we are getting 404 status code responses from the service (webhook) and the rollout is being considered as "Healthy".

abatilo commented 3 years ago

@jessesuen Any ideas when this will get worked on? This seems like kind of a non-starter for people that want automatic rollback on deployments

abatilo commented 3 years ago

If someone could give me a hint on where to start, I'd be willing to try and contribute!

abatilo commented 3 years ago

Following up here. It appears that setting failureLimit to 0 caused the rollout to cancel back. Someone had posted a similar question in the Argo Rollouts Slack and Jesse mentioned it. Maybe others here should try explicitly setting that value to 0?

jessesuen commented 3 years ago

There's been a lot of issues I've been working through with blue-green and analysis, in that it was basically broken. I feel like this issue may be covered by the v0.9.2 work in progress, as I have been focusing a lot on blue-green + analysis issues.

jessesuen commented 3 years ago

v0.9.2 is released with many fixes to blue-green in conjunction with analysis. I believe the issue is resolved but please reopen if still an issue.

ruben-ojeda-mednax commented 3 years ago

This is to confirm the issue was resolved:

ahmetavc commented 3 years ago

@jessesuen I am facing the same issue. My version is: Image: argoproj/argo-rollouts:v0.10.2

keithmattix commented 3 years ago

@jessesuen I'm seeing this happen with the release candidate v1.0.0-rc1 with helm chart version 0.5.0

.

ixxeL2097 commented 2 years ago

Same issue with release candidate 1.2.0-rc2 , does someone have any hint on this problem ?

ruben-ojeda-mednax commented 2 years ago

We haven't seen this issue again. A failed rollout always goes into a "Degraded" status if it fails the AnalysisRun. We are also running the same version.

ixxeL2097 commented 2 years ago

Ok I see. I am currently investigating on this. I am most likely doing something wrong, but my job fails and AnalysisRun stay healthy instead of degrading my rollout. Maybe something to do with my istio sidecar injection container I guess.

Here is my ArgoCD result and you can see the job/pod in failed status but AnalysisRun healthy

ruben-ojeda-mednax commented 2 years ago

No worries, if AnalysisRun STATUS was "Error", then the rollout STATUS should be "Degraded", but if it was "Successful", I would recommend you look for something else down stream.

ixxeL2097 commented 2 years ago

I tried to disable Istio sidecar injection and still same behaviour :

The AnalysisRun status is successful and I don't understand why actually

ruben-ojeda-mednax commented 2 years ago

Your pictures show jobs failing. Is your AnalysisTemplate provider a job? From here: https://argoproj.github.io/argo-rollouts/analysis/job/? I would double check the job command exit code too, just in case.

ixxeL2097 commented 2 years ago

Indeed my AnalysisTemplate provider is a job. I will double check the exit code first

ixxeL2097 commented 2 years ago

ok, I figured it out. Seemed like I used a wrong image to execute my job. I was using curlimages/curl:latest at first and then replacing that image with a different one from my personal library worked. Anyway, thanks a lot for your help dude !

EDIT: Actually I was wrong... The problem was not the image but the options I put into my job. Adding these options:

          ttlSecondsAfterFinished: 1000
          activeDeadlineSeconds: 120

caused my strange behaviour of failed job with successful AnalysisRun. So maybe a bug here

RE EDIT: ok sorry for saying bullshit. I finally understood the true reason. my count was equal to 1 and my failureLimit also equal to 1. You need count > failureLimit to make it work. Anyway...It's late Im tired and I should have had gone to bed instead of saying non sense. Maybe it will help someone :) Good night

adecchi-2inno commented 2 years ago

I have the same behaviour: Name: prometheus-metrics-success-rate Phase: Successful Count: 1 Failed: 1 Measurements: Finished At: 2022-03-15T14:28:05Z Phase: Failed Started At: 2022-03-15T14:28:05Z Value: [NaN] My failureLimit is 1 for each AnaysisTemplate and my count is 3 or each AnaysisTemplate too.

argoproj / argo-rollouts

Failed job results in successful analysis run #521