argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.75k stars 863 forks source link

Rollouts doesn't drop old replicaset/pods #3836

Open ajax-bychenok-y opened 1 month ago

ajax-bychenok-y commented 1 month ago

Checklist:

Describe the bug

Sometimes Argo Rollouts switches release to new version (replicaset) but don't remove old one, so pods are always running. After some digging into it I've realized that the reason of that is blank value of annotation argo-rollouts.argoproj.io/scale-down-deadline: "" but correct date should be there. That's why controller can't remove it later.

To Reproduce

I have no reproduce steps for this problem because it accidentally occurs after rollout process. Here is the process of my digging

https://github.com/argoproj/argo-rollouts/issues/1761#issuecomment-2331739689 https://github.com/argoproj/argo-rollouts/issues/1761#issuecomment-2332187024

Expected behavior

Controller shoud remove pods in non-active replica.

Screenshots

svc-c6dcc48cb is still alive when newer replicaset svc-865f9fcf88 was buried and even newer one svc-74ff5588fb is currently working.

image

Version

app version: v1.7.1+6a99ea9 helm version: "2.37.1"

Logs

Have no logs for now.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

zachaller commented 1 month ago

I think this is fixed in 1.7.2 or atleast improved can you try.

ajax-bychenok-y commented 1 month ago

image

This is log for v1.7.1+6a99ea9 (soon we are going to update to the latest version as advised) rollouts-fail-old-replica.json

Most interesting thing is

{"level":"info","msg":"Set 'scale-down-deadline' annotation on 'some-svc-55468fc7cb' to 2024-09-19T09:35:41Z (30s)","namespace":"staging-a","rollout":"some-svc","time":"2024-09-19T09:35:11Z"}
{"level":"info","msg":"synced ephemeral metadata nil to Pod some-svc-55468fc7cb-lkr5l","namespace":"staging-a","rollout":"some-svc","time":"2024-09-19T09:35:12Z"}
{"level":"info","msg":"synced ephemeral metadata nil to Pod some-svc-55468fc7cb-p25vp","namespace":"staging-a","rollout":"some-svc","time":"2024-09-19T09:35:12Z"}
{"level":"info","msg":"Conflict when updating replicaset some-svc-55468fc7cb, falling back to patch","namespace":"staging-a","rollout":"some-svc","time":"2024-09-19T09:35:12Z"}
{"level":"info","msg":"Patching replicaset with patch: {\"metadata\":{\"annotations\":{\"rollout.argoproj.io/desired-replicas\":\"2\",\"rollout.argoproj.io/revision\":\"235\",\"scale-down-deadline\":\"\"},\"labels\":{\"rollouts-pod-template-hash\":\"55468fc7cb\"}},\"spec\":{\"replicas\":2,\"selector\":{\"matchLabels\":{\"rollouts-pod-template-hash\":\"55468fc7cb\"}},\"template\":{\"metadata\":{\"annotations\":{\"ad.datadoghq.com/some-svc.checks\":\"{\\n  \\\"jmx\\\": {\\n    \\\"init_config\\\": {\\n      \\\"is_jmx\\\": true,\\n      \\\"collect_default_metrics\\\": true,\\n      \\\"collect_default_jvm_metrics\\\": true,\\n      \\\"new_gc_metrics\\\": true\\n    },\\n    \\\"instances\\\": [{\\n      \\\"host\\\": \\\"%%host%%\\\",\\n      \\\"port\\\": 8855\\n    }]\\n  }\\n}\\n\"},\"labels\":{\"app.kubernetes.io/instance\":\"some-svc\",\"app.kubernetes.io/managed-by\":\"Helm\",\"app.kubernetes.io/name\":\"some-svc\",\"env_name\":\"staging\",\"env_tag\":\"a\",\"helm.sh/chart\":\"some-svc-0.26.0-773.RELEASE\",\"rollouts-pod-template-hash\":\"55468fc7cb\"}}}}}","namespace":"staging-a","rollout":"some-svc","time":"2024-09-19T09:35:12Z"}
{"level":"info","msg":"synced ephemeral metadata nil to ReplicaSet some-svc-55468fc7cb","namespace":"staging-a","rollout":"some-svc","time":"2024-09-19T09:35:12Z"}
{"generation":485,"level":"info","msg":"No status changes. Skipping patch","namespace":"staging-a","resourceVersion":"185480139","rollout":"some-svc","time":"2024-09-19T09:35:12Z"}
{"generation":485,"level":"info","msg":"Reconciliation completed","namespace":"staging-a","resourceVersion":"185480139","rollout":"some-svc","time":"2024-09-19T09:35:12Z","time_ms":74.843767}

As result it sets argo-rollouts.argoproj.io/scale-down-deadline to '' and old replica set never goes down.