argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
16.45k stars 4.97k forks source link

Gracefully handle k8s resource size limit for large Application CRs #14486

Open crenshaw-dev opened 10 months ago

crenshaw-dev commented 10 months ago

Summary

Intuit had an app fail to sync when it hit ~3k resources managed in a single App. I believe the problem was that it attempted to update the sync status, which contained the status of all 3k resources, and we hit the k8s resource size limit.

We should provide more ways for the user to sacrifice certain features/conveniences to allow the resource to fit within the size limit. Ideas below.

Motivation

3000 really isn't that big a number.

Proposal

  1. store the status.resources and status.operationState.operation.sync.resources fields (or maybe just the whole status field) as gzip/base64 when the app hits a configurable number of managed resources
  2. Offload more resource info to Redis, like we did with health data
leoluz commented 10 months ago

To avoid API breaking changes another suggestion could be:

  1. Add a new fields in the status section dedicated for compressed data: status.resourcesGzip and status.operationState.operation.sync.resourcesGzip
  2. Check if the new CRD state will exceed the 1.5mb etcd limit and use the compressed fields instead of the existing ones.
  3. Change the logic in functions that read the status to verify if the data was persisted in the gzip field to decompress and add back in the original fields.

With this approach the great majority of the users wouldn't be impacted as the new fields would just be used when the CRD limit is exceeded.

crenshaw-dev commented 10 months ago

Yep, I like this. One question:

Check if the new CRD state will exceed the 1.5mb etcd limit

How would you propose to perform that check?

leoluz commented 10 months ago

How would you propose to perform that check?

You have the computed status field state that is about to be persisted isn't it? I was thinking about just check its size to drive the persisting logic.

crenshaw-dev commented 10 months ago

You have the computed status field state that is about to be persisted isn't it?

Not necessarily. Some places, e.g. persisting operation state, calculate only a patch: https://github.com/argoproj/argo-cd/blob/15eeb307eb03191e7581d8e616072de4fd4b20e0/controller/appcontroller.go#L1250

Even if we know the full status field contents, I see a few potential problems: 1) you're missing the sizes of top-level keys metadata, spec, and operation 2) marshaling the status before every write operation could be a performance drag

I'd suggest a lightweight, configurable heuristic like "if it manages > N resources, compress."

leoluz commented 10 months ago

I'd suggest a lightweight, configurable heuristic like "if it manages > N resources, compress."

Yes.. I like that too.

zswanson commented 9 months ago

Related, we are looking to enable argo with a large scale of applications soon (5k+) and we're concerned about hitting GKE limits where any single resource type in etcd must be < 800MB. An option to always compress statuses, regardless of number of resources, would be nice.

Google documentation for reference, I assume other cloud vendors would have similar limits. https://cloud.google.com/kubernetes-engine/docs/concepts/planning-large-clusters

cjin62 commented 2 weeks ago

Any further updates on when ArgoCD will be able to implement the improvements?