argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.76k stars 866 forks source link

Terminated metric is marked as successful under analysisrun status #1329

Open mclarke47 opened 3 years ago

mclarke47 commented 3 years ago

Summary

Terminated metric is marked as successful under analysisrun.status.metricResults[].

  - count: 1
    measurements:
    - finishedAt: "2021-07-07T14:45:45Z"
      message: metric terminated
      metadata:
        job-name: 951f5097-1cdc-4308-9e18-650c0d87c5c4.my-metric.1
      phase: Successful
      startedAt: "2021-07-07T14:44:45Z"
    message: metric terminated
    name: my-metric
    phase: Successful
    successful: 1

A metric result was terminated, I'm assuming because the rollout was failing legitimately. However the successful field reads 1 which isn't really true as the job/po for the metric was never run.

Diagnostics

What version of Argo Rollouts are you running? V0.10.0

time="2021-07-07T14:45:34Z" level=info msg="terminating in-progress measurement" analysisrun=my-ar metric=my-metric namespace=my-ns
time="2021-07-07T14:45:34Z" level=info msg="job my-rollout/951f5097-1cdc-4308-9e18-650c0d87c5c4.my-metric.1 terminated" analysisrun=my-ar metric=my-metric namespace=my-ns

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

jessesuen commented 3 years ago

FWIW, terminated analysis run equating to success was actually by design for the following reasons:

  1. It was the best option to choose from the existing phases without introducing a new state in the state machine:

    • Pending
    • Running
    • Successful
    • Failed
    • Error
    • Inconclusive
  2. With background analysis (which run indefinitely), the AnlaysisRun will always be terminated by the rollout at the end of the steps. We wanted this to be interpreted as: the background analysis successfully ran for the entire update without failing or erroring.

NaveenVastare commented 3 years ago

we are also observing similar issue. any suggestion on this?

jessesuen commented 3 years ago

@NaveenVastare what are you expecting the behavior to be and the status of the metric?

NaveenVastare commented 3 years ago

Pod Status is in CrashLoopBackOff but AnalysisRun gave status as Successful which is not expected. Without running job metrics, AnalysisRun gave status as Successful which is leading to rollback to bad version.

Name:            application-app
Namespace:       app
Status:          ◌ Progressing
Message:         updated replicas are still becoming available
Strategy:        BlueGreen
Images:          62.dkr.ecr.ap-south-1.amazonaws.com/application:application-47 (stable, active)
Replicas:
  Desired:       1
  Current:       1
  Updated:       1
  Ready:         0
  Available:     0

NAME                                                          KIND         STATUS              AGE    INFO
⟳ application-app                                              Rollout      ◌ Progressing       7d     
├──# revision:4                                                                                       
│  ├──⧉ application-app-579f49bb4b                              ReplicaSet   ◌ Progressing       19h    stable,active
│  │  └──□ application-app-579f49bb4b-xqqq9                     Pod          ✖ CrashLoopBackOff  23m    ready:1/2,restarts:8
│  └──α application-app-579f49bb4b-4-pre                        AnalysisRun  ✔ Successful        19h    ✔ 1
│     └──⊞ 7d33e214-0a03-4563-8ce3-5dfcb9c21ae3.smoke-test.1    Job          ✔ Successful        19h   

Job status :

status: message: run terminated metricResults:

anhtranquang commented 2 years ago

hi @jessesuen , any workaround for this issue? The expected behavior should be: Rollback the rollout when the pod status is CrashLoopBackOff imho!

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity.

oyesaurav commented 2 months ago

I was facing the same issue, a few days ago i.e. the analysis run was terminated primarily due to the CrashLoopBackOff of the pods related to the rollout,

I didn't find a solution to abort the rollout in such cases, but can prevent the CrashLoopBackOff to terminate the analysis run. I figured I was missing a Readiness Probe in one of my Rollout resource. This will make the argo app status progressing till the container is not ready and the rollout will not proceed to the analysis stage till its ready. This solved my issue for time being, and gave an argo notif trigger for progressing state more than 2 min, to alert if the container is not able to start. Would appreciate any suggestion!