Open mclarke47 opened 3 years ago
FWIW, terminated analysis run equating to success was actually by design for the following reasons:
It was the best option to choose from the existing phases without introducing a new state in the state machine:
With background analysis (which run indefinitely), the AnlaysisRun will always be terminated by the rollout at the end of the steps. We wanted this to be interpreted as: the background analysis successfully ran for the entire update without failing or erroring.
we are also observing similar issue. any suggestion on this?
@NaveenVastare what are you expecting the behavior to be and the status of the metric?
Pod Status is in CrashLoopBackOff but AnalysisRun gave status as Successful which is not expected. Without running job metrics, AnalysisRun gave status as Successful which is leading to rollback to bad version.
Name: application-app
Namespace: app
Status: ◌ Progressing
Message: updated replicas are still becoming available
Strategy: BlueGreen
Images: 62.dkr.ecr.ap-south-1.amazonaws.com/application:application-47 (stable, active)
Replicas:
Desired: 1
Current: 1
Updated: 1
Ready: 0
Available: 0
NAME KIND STATUS AGE INFO
⟳ application-app Rollout ◌ Progressing 7d
├──# revision:4
│ ├──⧉ application-app-579f49bb4b ReplicaSet ◌ Progressing 19h stable,active
│ │ └──□ application-app-579f49bb4b-xqqq9 Pod ✖ CrashLoopBackOff 23m ready:1/2,restarts:8
│ └──α application-app-579f49bb4b-4-pre AnalysisRun ✔ Successful 19h ✔ 1
│ └──⊞ 7d33e214-0a03-4563-8ce3-5dfcb9c21ae3.smoke-test.1 Job ✔ Successful 19h
Job status :
status: message: run terminated metricResults:
hi @jessesuen , any workaround for this issue? The expected behavior should be: Rollback the rollout when the pod status is CrashLoopBackOff imho!
This issue is stale because it has been open 60 days with no activity.
I was facing the same issue, a few days ago i.e. the analysis run was terminated primarily due to the CrashLoopBackOff of the pods related to the rollout,
I didn't find a solution to abort the rollout in such cases, but can prevent the CrashLoopBackOff to terminate the analysis run. I figured I was missing a Readiness Probe in one of my Rollout resource. This will make the argo app status progressing till the container is not ready and the rollout will not proceed to the analysis stage till its ready. This solved my issue for time being, and gave an argo notif trigger for progressing state more than 2 min, to alert if the container is not able to start. Would appreciate any suggestion!
Summary
Terminated metric is marked as successful under
analysisrun.status.metricResults[]
.A metric result was terminated, I'm assuming because the rollout was failing legitimately. However the
successful
field reads1
which isn't really true as the job/po for the metric was never run.Diagnostics
What version of Argo Rollouts are you running? V0.10.0
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.