argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.8k stars 877 forks source link

Rollout never ran analysis run, ProgressDeadlineExceeded #2863

Closed meeech closed 1 year ago

meeech commented 1 year ago

Checklist:

Hit a weird bug yesterday. Rollouts 1.5.1

A rollout happened, the pod was crashlooping. The problem is the Analysis run never started. I expected the analysis run to execute as it had all the times before

I found this:

Message: ReplicaSet "smoke-test-go-client-service-v1-7f878b67b9" has timed out progressing.

I am unable to reproduce. When I redeployed, it was all good - in that the pod started crashlooping, and the analysis run started and correctly aborted the rollout, and I got a notification.

What I'm trying to understand is:

This is a basic no traffic routing canary strategy rollout.

Attached is the logs from the timeperiod from the rollout controller. let me know if there is any other info to provide.

k8s_pod_name,message
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:11Z"" level=info msg=""Created ReplicaSet smoke-test-go-client-service-v1-7f878b67b9"" namespace=default rollout=smoke-test-go-client-service-v1"
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:11Z"" level=info msg=""Enqueueing parent of default/smoke-test-go-client-service-v1-7f878b67b9: Rollout default/smoke-test-go-client-service-v1"""
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:11Z"" level=info msg=""Enqueueing parent of default/smoke-test-go-client-service-v1-7f878b67b9: Rollout default/smoke-test-go-client-service-v1"""
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:11Z"" level=info msg=""Enqueueing parent of default/smoke-test-go-client-service-v1-7f878b67b9: Rollout default/smoke-test-go-client-service-v1"""
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:11Z"" level=info msg=""Enqueueing parent of default/smoke-test-go-client-service-v1-7f878b67b9: Rollout default/smoke-test-go-client-service-v1"""
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:11Z"" level=info msg=""Enqueueing parent of default/smoke-test-go-client-service-v1-7f878b67b9: Rollout default/smoke-test-go-client-service-v1"""
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:11Z"" level=info msg=""Not finished reconciling new ReplicaSet 'smoke-test-go-client-service-v1-7f878b67b9'"" namespace=default rollout=smoke-test-go-client-service-v1"
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:11Z"" level=info msg=""Created ReplicaSet smoke-test-go-client-service-v1-7f878b67b9 (revision 18)"" event_reason=NewReplicaSetCreated namespace=default rollout=smoke-test-go-client-service-v1"
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:11Z"" level=info msg=""Updating replica set 'smoke-test-go-client-service-v1-7f878b67b9' revision from 0 to 18"" namespace=default rollout=smoke-test-go-client-service-v1"
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:11Z"" level=info msg=""Scaled up ReplicaSet smoke-test-go-client-service-v1-7f878b67b9 (revision 18) from 0 to 1"" event_reason=ScalingReplicaSet namespace=default rollout=smoke-test-go-client-service-v1"
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:11Z"" level=error msg=""Operation cannot be fulfilled on replicasets.apps \""smoke-test-go-client-service-v1-7f878b67b9\"": the object has been modified; please apply your changes to the latest version and try again\n"" error=""<nil>"""
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:11Z"" level=info msg=""Event(v1.ObjectReference{Kind:\""Rollout\"", Namespace:\""default\"", Name:\""smoke-test-go-client-service-v1\"", UID:\""34a2d73c-d25e-47ee-ac1c-99d485ba4fbd\"", APIVersion:\""argoproj.io/v1alpha1\"", ResourceVersion:\""2030092503\"", FieldPath:\""\""}): type: 'Normal' reason: 'NewReplicaSetCreated' Created ReplicaSet smoke-test-go-client-service-v1-7f878b67b9 (revision 18)"""
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:11Z"" level=error msg=""rollout syncHandler error: Operation cannot be fulfilled on replicasets.apps \""smoke-test-go-client-service-v1-7f878b67b9\"": the object has been modified; please apply your changes to the latest version and try again"" namespace=default rollout=smoke-test-go-client-service-v1"
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:11Z"" level=error msg=""roCtx.reconcile err Operation cannot be fulfilled on replicasets.apps \""smoke-test-go-client-service-v1-7f878b67b9\"": the object has been modified; please apply your changes to the latest version and try again"" generation=14 namespace=default resourceVersion=2030092497 rollout=smoke-test-go-client-service-v1"
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:11Z"" level=info msg=""Event(v1.ObjectReference{Kind:\""Rollout\"", Namespace:\""default\"", Name:\""smoke-test-go-client-service-v1\"", UID:\""34a2d73c-d25e-47ee-ac1c-99d485ba4fbd\"", APIVersion:\""argoproj.io/v1alpha1\"", ResourceVersion:\""2030092521\"", FieldPath:\""\""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled up ReplicaSet smoke-test-go-client-service-v1-7f878b67b9 (revision 18) from 0 to 1"""
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:11Z"" level=info msg=""Set rollout condition: &RolloutCondition{Type:Progressing,Status:True,LastUpdateTime:2023-06-28 19:29:11.441009559 +0000 UTC m=+417256.015960759,LastTransitionTime:2023-06-28 19:29:11.441009639 +0000 UTC m=+417256.015960827,Reason:NewReplicaSetCreated,Message:Created new replica set \""smoke-test-go-client-service-v1-7f878b67b9\"",}"" namespace=default rollout=smoke-test-go-client-service-v1"
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:29:12Z"" level=info msg=""Patched: {\""status\"":{\""HPAReplicas\"":5,\""conditions\"":[{\""lastTransitionTime\"":\""2023-06-26T15:14:09Z\"",\""lastUpdateTime\"":\""2023-06-26T15:14:09Z\"",\""message\"":\""Rollout is paused\"",\""reason\"":\""RolloutPaused\"",\""status\"":\""False\"",\""type\"":\""Paused\""},{\""lastTransitionTime\"":\""2023-06-28T01:39:58Z\"",\""lastUpdateTime\"":\""2023-06-28T01:39:58Z\"",\""message\"":\""Rollout has minimum availability\"",\""reason\"":\""AvailableReason\"",\""status\"":\""True\"",\""type\"":\""Available\""},{\""lastTransitionTime\"":\""2023-06-28T19:29:11Z\"",\""lastUpdateTime\"":\""2023-06-28T19:29:11Z\"",\""message\"":\""Rollout is not healthy\"",\""reason\"":\""RolloutHealthy\"",\""status\"":\""False\"",\""type\"":\""Healthy\""},{\""lastTransitionTime\"":\""2023-06-28T19:29:11Z\"",\""lastUpdateTime\"":\""2023-06-28T19:29:11Z\"",\""message\"":\""RolloutCompleted\"",\""reason\"":\""RolloutCompleted\"",\""status\"":\""False\"",\""type\"":\""Completed\""},{\""lastTransitionTime\"":\""2023-06-26T15:14:09Z\"",\""lastUpdateTime\"":\""2023-06-28T19:29:12Z\"",\""message\"":\""ReplicaSet \\\""smoke-test-go-client-service-v1-7f878b67b9\\\"" is progressing.\"",\""reason\"":\""ReplicaSetUpdated\"",\""status\"":\""True\"",\""type\"":\""Progressing\""}],\""replicas\"":5,\""updatedReplicas\"":1}}"" generation=14 namespace=default resourceVersion=2030092549 rollout=smoke-test-go-client-service-v1"
argo-rollouts-v1-679ffdbccc-tff77,"time=""2023-06-28T19:39:13Z"" level=info msg=""Patched: {\""status\"":{\""conditions\"":[{\""lastTransitionTime\"":\""2023-06-26T15:14:09Z\"",\""lastUpdateTime\"":\""2023-06-26T15:14:09Z\"",\""message\"":\""Rollout is paused\"",\""reason\"":\""RolloutPaused\"",\""status\"":\""False\"",\""type\"":\""Paused\""},{\""lastTransitionTime\"":\""2023-06-28T01:39:58Z\"",\""lastUpdateTime\"":\""2023-06-28T01:39:58Z\"",\""message\"":\""Rollout has minimum availability\"",\""reason\"":\""AvailableReason\"",\""status\"":\""True\"",\""type\"":\""Available\""},{\""lastTransitionTime\"":\""2023-06-28T19:29:11Z\"",\""lastUpdateTime\"":\""2023-06-28T19:29:11Z\"",\""message\"":\""Rollout is not healthy\"",\""reason\"":\""RolloutHealthy\"",\""status\"":\""False\"",\""type\"":\""Healthy\""},{\""lastTransitionTime\"":\""2023-06-28T19:29:11Z\"",\""lastUpdateTime\"":\""2023-06-28T19:29:11Z\"",\""message\"":\""RolloutCompleted\"",\""reason\"":\""RolloutCompleted\"",\""status\"":\""False\"",\""type\"":\""Completed\""},{\""lastTransitionTime\"":\""2023-06-28T19:39:13Z\"",\""lastUpdateTime\"":\""2023-06-28T19:39:13Z\"",\""message\"":\""ReplicaSet \\\""smoke-test-go-client-service-v1-7f878b67b9\\\"" has timed out progressing.\"",\""reason\"":\""ProgressDeadlineExceeded\"",\""status\"":\""False\"",\""type\"":\""Progressing\""}],\""message\"":\""ProgressDeadlineExceeded: ReplicaSet \\\""smoke-test-go-client-service-v1-7f878b67b9\\\"" has timed out progressing.\"",\""phase\"":\""Degraded\""}}"" generation=14 namespace=default resourceVersion=2030092551 rollout=smoke-test-go-client-service-v1"


---
<!-- Issue Author: Don't delete this message to encourage other users to support your issue! -->
**Message from the maintainers**:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
meeech commented 1 year ago

After looking at this further (thanks @zachaller) I think here is what happened - hopefully it helps others.

What might have happened:

Re: It working before - I think in some previous cases (this is across many different deployments) that the pod managed to enter a ready state before it was crashing, so we successfully passed step 0, and everything worked as expected.

What could have been done to get alerted about this using rollouts?:

So there are 2 actions taken in this scenario to fix the issue (with configuration - not a rollout issue at all):