argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.79k stars 874 forks source link

setHeaderRoute error and memory leak #3276

Open dtelaroli opened 11 months ago

dtelaroli commented 11 months ago

Checklist:

Describe the bug

Problem 1: The argo-rollouts is adding duplicated header route, flooding the virtual service with a content that is bigger than the etcd supports.

Problem 2: After the problem 1, the argo-rollouts pod is leaking memory consuming all the node memory, it restarts and starts again the cycle. This problem happens if happens any problem which generates a big manifest. I saw same behavior using the analysis-run for 24h of metrics collection.

time="2023-12-27T15:31:07Z" level=warning msg="Request entity too large: limit is 3145728" event_reason=TrafficRoutingError namespace=psm-test rollout=clismo

To Reproduce

I don't know how to reproduce the Problem 1. It's possible to reproduce the Problem 2 creating a virtual service with this route duplicated.

- match:
        - headers:
            x-version:
              exact: PR-132-b36d66a
      name: header-route-version
      route:
        - destination:
            host: clismo
            subset: canary
          weight: 100

It's needed more than 6k lines to error happen. After that, execute a change in the rollout to starts a new rollout version.

Expected behavior

Screenshots

image

Version

v1.5.0

Logs

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

time="2023-12-27T15:38:47Z" level=info msg="Started syncing rollout" generation=359 namespace=psm-test resourceVersion=3287885544 rollout=clismo
time="2023-12-27T15:38:48Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=psm-test rollout=clismo
time="2023-12-27T15:38:48Z" level=info msg="Reconciling TrafficRouting with type 'Istio'" namespace=psm-test rollout=clismo
time="2023-12-27T15:38:50Z" level=warning msg="Request entity too large: limit is 3145728" event_reason=TrafficRoutingError namespace=psm-test rollout=clismo
time="2023-12-27T15:38:50Z" level=error msg="roCtx.reconcile err Request entity too large: limit is 3145728" generation=359 namespace=psm-test resourceVersion=3287885544 rollout=clismo
time="2023-12-27T15:38:50Z" level=info msg="Event(v1.ObjectReference{Kind:\"Rollout\", Namespace:\"psm-test\", Name:\"clismo\", UID:\"15051ab3-a968-4673-b1af-55ac0a8c525d\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"3287885544\", FieldPath:\"\"}): type: 'Warning' reason: 'TrafficRoutingError' Request entity too large: limit is 3145728"

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

zachaller commented 11 months ago

I think this is possibly fixed in 1.6, could you try 1.6.4?

https://github.com/argoproj/argo-rollouts/pull/2887

dtelaroli commented 11 months ago

Hi @zachaller I have another issue that is a blocker to me upgrade the argo-rollouts. https://github.com/argoproj/argo-rollouts/issues/3223

dtelaroli commented 11 months ago

Anyway, the PR https://github.com/argoproj/argo-rollouts/pull/2887 fixes the problem 1, it doesn't solve the problem 2.

andyliuliming commented 6 months ago

@dtelaroli did you have some findings for the memory footprint issue? we observed some potential memory leak issue in our env too. (usually the memory usage is 200Mi, but after 15 days, it becomes 600Mi, althrough we only have about 5 rollouts in our cluster.

dtelaroli commented 6 months ago

@andyliuliming i've discovered that the issue happens when you have a big manifest synced by the application. There is a limit of size and when the size is over the limit the argo-rollouts dispatch error each sync cicle and this generates the memory leak. Request entity too large: limit is 3145728 Fixing the big manifest, the issue disappear.

Another issue that I had is because the rollouts adds a empty step during the setHeaderRoute: - {} This brakes the rollouts also, generating memory leak.