argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.8k stars 5.43k forks source link

ApplicationSet Controller constantly reconciling with enable-progressive-syncs flag #14712

Open ciiay opened 1 year ago

ciiay commented 1 year ago

Checklist:

Describe the bug After applying applicationSet manifests with enable-progressive-syncs flag, the ApplicationSet controller sends git fetch requests constantly.

To Reproduce In our case, the customer used openshift-gitops default argocd to deploy an applicaionSet and in the argocd manifest, it has enable-progressive-syncs flag.

spec:
  applicationSet:
    extraCommandArgs:
      - --enable-progressive-syncs

Expected behavior ApplicationSet Controller should only reconcile every 3 mins as the default requeue time is 3 mins. Now it's constantly calling the github.

Version Observed this issue on both v2.6.7 and v2.7.6.

Logs

time="2023-07-24T21:47:01Z" level=info msg="git fetch origin main --tags --force --prune" dir=/tmp/https___github.com_some_openshift-iac-multicluster execID=7478a
time="2023-07-24T21:47:01Z" level=debug msg="received update event from owning an application"
time="2023-07-24T21:47:01Z" level=debug msg="requeue: false caused by application nfs-provisioner-operator\n"
time="2023-07-24T21:47:01Z" level=debug msg="received update event from owning an application"
time="2023-07-24T21:47:01Z" level=debug msg="requeue: false caused by application nfs-provisioner-config\n"
time="2023-07-24T21:47:01Z" level=debug msg="received update event from owning an application"
time="2023-07-24T21:47:01Z" level=debug msg="requeue: false caused by application lvms-config\n"
time="2023-07-24T21:47:01Z" level=debug msg="received update event from owning an application"
time="2023-07-24T21:47:01Z" level=debug msg="requeue: false caused by application acm-config\n"
time="2023-07-24T21:47:01Z" level=debug msg="received update event from owning an application"
time="2023-07-24T21:47:01Z" level=debug msg="requeue: false caused by application openshift-virtualization\n"
time="2023-07-24T21:47:01Z" level=debug msg="received update event from owning an application"
time="2023-07-24T21:47:01Z" level=debug msg="requeue: false caused by application acs-operator\n"
time="2023-07-24T21:47:01Z" level=debug msg="received update event from owning an application"
time="2023-07-24T21:47:01Z" level=debug msg="requeue: false caused by application acm-operator\n"
time="2023-07-24T21:47:01Z" level=debug msg="received update event from owning an application"
time="2023-07-24T21:47:01Z" level=debug msg="requeue: false caused by application secured-cluster-policy\n"
time="2023-07-24T21:47:01Z" level=debug msg="received update event from owning an application"
time="2023-07-24T21:47:01Z" level=debug msg="requeue: false caused by application lvms-operator\n"
time="2023-07-24T21:47:01Z" level=debug msg="received update event from owning an application"
time="2023-07-24T21:47:01Z" level=debug msg="requeue: false caused by application ansible-automation-platform-operator\n"
time="2023-07-24T21:47:01Z" level=debug msg="received update event from owning an application"
time="2023-07-24T21:47:01Z" level=debug msg="requeue: false caused by application oauth-config\n"
time="2023-07-24T21:47:01Z" level=debug msg="received update event from owning an application"
time="2023-07-24T21:47:01Z" level=debug msg="requeue: false caused by application acs-central-configuration\n"
crenshaw-dev commented 1 year ago

Full conversation about this issue: https://cloud-native.slack.com/archives/C01TSERG0KZ/p1690321766622059?thread_ts=1664885597.178089&cid=C01TSERG0KZ

agaudreault commented 1 year ago

Another symptom but most likely the same root case, but after enabling progressive sync on an instance (2.8.0-rc5) with only 5 applicationSet, our ApplicationSet controller CPU went to the roof and I could see in the logs Application <APP> is already synced and healthy, updating its ApplicationSet status to Healthy 2700x per hour for every Application.

crenshaw-dev commented 1 year ago

@wmgroot any guesses? :-)

stylianosrigas commented 1 year ago

This is very similar to what we are facing in https://github.com/argoproj/argo-cd/issues/12878 @crenshaw-dev The fix https://github.com/argoproj/argo-cd/issues/12878#issuecomment-1642257603 did not help with the issue.

crenshaw-dev commented 1 year ago

Yeah, this isn't a normalization issue... My very rough guess is that some field in here is churning: https://github.com/argoproj/argo-cd/blob/c721592d210383dadcf0bf0dfcfce9c7a1794162/applicationset/controllers/applicationset_controller.go#L1408-L1417

crenshaw-dev commented 1 year ago

Maybe the app health is flapping, or maybe the ApplicationSet controller is erroneously re-triggering sync operations, bumping the status.operationState.startedAt value each time.

stafot commented 1 year ago

Also confirming all @ciiay observations. Even with progressiveSyncs disabled git is hit massively by applicationSet controller but not as massively as with it enabled.
Logs we observe are "received update event from owning an application" the requeue and the unknown Application

wmgroot commented 1 year ago

Last time I looked into this I noticed one of the status fields was flipping constantly, it seemed like two different status entries were fighting over the same key. I did look into how the status field was being set, but I didn’t see a clear issue from the progressive sync code.

crenshaw-dev commented 1 year ago

two different status entries were fighting over the same key

Can you clarify what "status entries" means here?

crenshaw-dev commented 1 year ago

Even with progressiveSyncs disabled git is hit massively by applicationSet controller

@stafot I think you're facing a different issue. Are you running a version that includes this fix?

stafot commented 1 year ago

Yes we are observing this behaviour after updating to the latest helm chart of Argo CD which afaiu contains the above-mentioned fix.

crenshaw-dev commented 1 year ago

Gotcha. So separate issue, likely unrelated to progressive syncs.

stafot commented 1 year ago

@crenshaw-dev We are observing exactly the same behaviour as this ticket of @ciiay describes. Our git getting hammered by applicationSet controller with enabled progressive syncs the hammering is linearly increasing and with them disabled it gets stabilized but in a really higher level than our reconciliation activity baseline before 2.6.3. When we are moving to any more recent version than 2.6.2 we are experiencing this hammering effect. So may or may not related somehow with progressive syncs, but for sure is related to applicationSet controller from 2.6.3 version and ahead.

wanddynosios commented 1 year ago

I'm kind of lost between https://github.com/argoproj/argo-cd/issues/12878 and this issue.

We are using appsets (without progressive sync), and after upgrading from v2.7.3 to 2.7.10 we are seeing the following spike:

image
crenshaw-dev commented 1 year ago

To start narrowing down the issue(s), I think we're dying ru need full details, i.e. ApplicationSet specs and logs.

crenshaw-dev commented 1 year ago

It's also possible that the spike is due to an unrelated issue, since the application controller also triggers checkouts.

crenshaw-dev commented 1 year ago

@stafot #12612 seems like the most likely suspect in your case. Unless you're using multi-source apps, in which case I noticed another possibly suspicious commit.

andrleite commented 1 year ago

@stafot #12612 seems like the most likely suspect in your case. Unless you're using multi-source apps, in which case I noticed another possibly suspicious commit.

@crenshaw-dev We’re using muti-source apps

crenshaw-dev commented 1 year ago

Then #12379 might be involved.

To make progress, I think we have to treat these all as separate issues and open new, fully-described issues for each. If things turn out to be related, we can consolidate. But the symptom "lots of git requests" can have a lot of different possible causes.

stafot commented 1 year ago

OK let's move these comments then to this https://github.com/argoproj/argo-cd/issues/12878 for the case we describing @andrleite and I

ericblackburn commented 1 year ago

two different status entries were fighting over the same key

Can you clarify what "status entries" means here?

Matt might be talking about status.applicationStatus: https://github.com/argoproj/argo-cd/issues/15297/. Though, this is a bug where the status flips constantly because the appset isn't a progressive sync type and it ends up getting processed and unprocessed by progressive sync logic. It does put a lot of load on the Argo system.

Before and after turning Off Progressive sync for an Argo instance with no appsets opted into Progressive Sync image

This seems to indicate the load is Progressive Sync logic causing large processing load on all AppSets.

The PR to fix this is https://github.com/argoproj/argo-cd/pull/15299