Issue with spec.syncPolicy.automated.selfHeal Option Ignoring Timeouts(probably)

tebaly commented 2 months ago

I believe there is a problem with the spec.syncPolicy.automated.selfHeal option in Argo CD. The current implementation seems to ignore the configured timeouts and triggers checks more frequently than expected, leading to unnecessary resource consumption.

Details:

I have configured the following timeouts:

timeout.reconciliation: 900s
controller.self.heal.timeout.seconds: 300

My expectation is that the controller will not attempt to change or check the cluster state within these timeout periods. Specifically, I expect to see no synchronization attempts in the controller logs for at least five minutes. However, this is not happening. Instead, the controller constantly attempts to check and synchronize the state, as shown in the logs below:

info msg="Refreshing app status (controller refresh requested), level (1)" application=
info msg="Comparing app state (cluster: 
info msg="Normalized app spec:
info msg="Skipping auto-sync: application status is Synced"
info msg="Update successful" application=
info msg="Reconciliation completed" application=
info msg="Refreshing app status (comparison expired, requesting refresh. reconciledAt: 2024-07-29 09:23:11 +0000 UTC, expiry: 15m0s), level (2)" application=

Hypothesis:

It appears that the controller might be responding to events in the cluster and ignoring the configured timeouts. One possible trigger for these frequent checks could be automatic changes in the number of replicas managed by a HorizontalPodAutoscaler, which, in this case, seems unnecessary.

Request:

I would like the ability to disable all checks and events except for the regular timeout checks. The controller should respect the configured timeouts and not attempt to check or synchronize the cluster state during these intervals unless explicitly required.

Thank you for looking into this issue.

juwon8891 commented 2 months ago

I can't reproduce this issue. Can you tell me how to reproduce it in detail?

tebaly commented 2 months ago

this is my guess. Please configure HorizontalPodAutoscaler so that it changes the number of replicas automatically - if Argo CD does not react to these automatic changes in the cluster, then there is no problem . a little later I will set these timeouts to 86400 in my cluster and show the logs

tebaly commented 1 month ago

argocd-application-controller-0

ARGOCD_RECONCILIATION_TIMEOUT : 30000s ARGOCD_APPLICATION_CONTROLLER_SELF_HEAL_TIMEOUT_SECONDS : 30000

Previously I installed using Helm - Now I changed the timeouts in the variables file and

helm upgrade argocd .

The controller has rebooted with new settings. It still does something endlessly and takes up as much CPU time as it can.

level=info msg="Update successful" application=
level=info msg="Updated health status: Healthy -> Progressing"
level=info msg="Refreshing app status (controller refresh requested), level (1)
level=info msg="Comparing app state ...
level=info msg="GetRepoObjs stats" application=

There are many such lines and they appear in batches with breaks of a couple of minutes, I think the processor can't go faster, it's busy with something...

With the above settings I expect the controller to be idle. At least the CPU load should be lower - but nothing has changed

Screenshot_20240903_215109

Screenshot_20240903_215408

Screenshot_20240903_215548

Screenshot_20240903_215758

tebaly commented 1 month ago

it could be that he's been storing tasks in a queue for months and is now trying to complete them all. That is, there's a queue of tens/hundreds of thousands of tasks. And he'll continue to complete them all?

Now it continues to do what I described above and without stopping it grows the message log

it is interrupted only when it physically needs time

tebaly commented 1 month ago

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
  labels:
    app.kubernetes.io/component: server
    app.kubernetes.io/instance: argocd
    app.kubernetes.io/name: argocd-cm
    app.kubernetes.io/part-of: argocd
data:
  ...
  resource.customizations.ignoreResourceUpdates.all: |
    jsonPointers:
    - /status
    - /metadata

this didn't change anything either

How to debug this - Refreshing app status (controller refresh requested)

argoproj / argo-cd

Issue with spec.syncPolicy.automated.selfHeal Option Ignoring Timeouts(probably) #19289

helm upgrade argocd .