argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.59k stars 5.36k forks source link

Does not work ignoreResourceUpdates #15594

Closed freedbka closed 2 weeks ago

freedbka commented 1 year ago

Checklist:

Describe the bug Hello! I have a problem with a Reconciliation loop, due to the fact that some resources are constantly changing. Requesting app refresh caused by object update Following your instructions, I found out which resource allows constant updates. https://argo-cd.readthedocs.io/en/stable/operator-manual/reconcile/#finding-resources-to-ignore This is config-map kops-controller-leader in namespace kube-system its metadata is constantly changing

 control-plane.alpha.kubernetes.io/leader: >-
      {"holderIdentity":"ip-**-**-**-**_*********************","leaseDurationSeconds":15,"acquireTime":"2023-08-29T09:27:07Z","renewTime":"2023-09-20T12:56:56Z","leaderTransitions":0}

which leads to a refresh of all argocd applications approximately every second I tried adding exceptions to argocd-cm as in the documentation but it still generates millions of updates per day https://argo-cd.readthedocs.io/en/stable/operator-manual/argocd-cm-yaml/

  resource.customizations.ignoreDifferences.all: |
    jqPathExpressions:
    - '.metadata.annotations."control-plane.alpha.kubernetes.io/leader"'
    - .metadata.resourceVersion
    managedFieldsManagers:
    - kube-controller-manager
    - external-secrets
    jsonPointers:
    - /spec/replicas
    - /metadata/resourceVersion
    - /metadata/annotations/control-plane.alpha.kubernetes.io~1leader
  resource.customizations.ignoreResourceUpdates._ConfigMap: |
    jqPathExpressions:
    - '.metadata.annotations."control-plane.alpha.kubernetes.io/leader"'
    - .metadata.resourceVersion
  resource.customizations.ignoreResourceUpdates.all: |
    jqPathExpressions:
    - '.metadata.annotations."control-plane.alpha.kubernetes.io/leader"'
    - .metadata.resourceVersion
    jsonPointers:
    - /status
    - /metadata/resourceVersion
    - /metadata/annotations/control-plane.alpha.kubernetes.io~1leader
  resource.ignoreResourceUpdatesEnabled: 'true'

Screenshots image

Version

v.2.8.3
duizabojul commented 1 year ago

Same here, it slow down all argocd operations. The weird thing is the config-map is not tracked by argo-cd, it is created by a controller, so I don't understand why argocd watch it. Maybe because i activated orphaned resources in projects....

kollad commented 1 year ago

Had the same problem, completely removing orphanedResources option from the main AppProject helped image

freedbka commented 1 year ago

@kollad It worked. Thank you!

duizabojul commented 1 year ago

I don't think this should be closed, this behavior is still a bug.

freedbka commented 1 year ago

@duizabojul Ok I will reopen

Sathish-rafay commented 10 months ago

Had the same problem, completely removing orphanedResources option from the main AppProject helped image

@kollad Not getting what needs to change exactly. Can you please elaborate what exactly need to change?

savar commented 9 months ago

We have the same situation with one Elasticsearch operator ConfigMap for leader election. Also EndpointSlice's are creating a lot of reconciliations.

We tried to ignore the updates as on the configmap the annotation for the leader is updating as well as the resourceVersion

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"elastic-operator-0_098489e3-3b66-4d7b-b17e-ac1555175d69","leaseDurationSeconds":15,"acquireTime":"2023-12-28T16:08:12Z","renewTime":"2024-01-05T11:53:17Z","leaderTransitions":179}'
  creationTimestamp: "2022-06-10T11:38:35Z"
  name: elastic-operator-leader
  namespace: elastic-operator
  resourceVersion: "815237253"
  uid: c876a2c1-efd0-4902-970c-02d4e2531b81

on the EndpointSlice's the annotation renewTime and also the resourceVersion are constantly changing

--- /tmp/first.yaml     2024-01-05 13:07:44.109295084 +0100
+++ /tmp/second.yaml    2024-01-05 13:07:54.485291050 +0100
@@ -19,7 +19,7 @@
     acquireTime: "2023-06-14T09:09:12.099141+00:00"
     leader: some-service-name-0
     optime: "4445962240"
-    renewTime: "2024-01-05T12:07:36.500017+00:00"
+    renewTime: "2024-01-05T12:07:46.499734+00:00"
     transitions: "3"
     ttl: "30"
   creationTimestamp: "2023-06-14T09:09:13Z"
@@ -42,7 +42,7 @@
     kind: Endpoints
     name: some-service-name
     uid: cc85b6fa-178d-4364-aa52-cb270b8ef44d
-  resourceVersion: "815254315"
+  resourceVersion: "815254479"
   uid: 56c48541-440c-4f49-9783-8e5c5338e72d
 ports:
 - name: postgresql

(side node, the EndpointSlice is probably anyways to be ignored but see https://github.com/argoproj/gitops-engine/pull/469)

We tried to ignore these updates with this config:

# SEE:
#  documentation: https://argo-cd.readthedocs.io/en/release-2.8/operator-manual/reconcile/
#  example config: https://argo-cd.readthedocs.io/en/stable/operator-manual/argocd-cm-yaml/
resource.ignoreResourceUpdatesEnabled: "true"
resource.customizations.ignoreResourceUpdates.all: |
  jsonPointers:
    - /metadata/resourceVersion
resource.customizations.ignoreResourceUpdates.ConfigMap: |
  jqPathExpressions:
    # ElasticOperator is updating this around 2 times per second
    - '.metadata.annotations."control-plane.alpha.kubernetes.io/leader"'
resource.customizations.ignoreResourceUpdates.discovery.k8s.io_EndpointSlice: |
  jsonPointers:
    # EndpointSlices should be ignored completely as Endpoints are already 
    # (see: https://github.com/argoproj/gitops-engine/pull/469) so until this is
    # done automatically the ignorance of `/metadata/resourceVersion` for all resources
    # plus ignoring this annotation should reduce the amount of updates significantly
    - /metadata/annotations/renewTime

so either our config is not correct or the feature is not working on these resources.. they are both "orphaned" resources, so maybe the feature actually doesn't work on non-managed-resources?

@Sathish-rafay I think hat @kollad meant is https://argo-cd.readthedocs.io/en/stable/user-guide/orphaned-resources/ so removing the setting altogether helped him, as most updates come from these non-managed resources which update constantly image

savar commented 9 months ago

so either our config is not correct or the feature is not working on these resources.. they are both "orphaned" resources, so maybe the feature actually doesn't work on non-managed-resources?

Just checked, the ConfigMap is an "orphanedResource" but the EndpointSlice is supposedly not. But I guess the latter is tracked via the OwnerReference to the Endpoint which is in theory also not managed by ArgoCD but I bet (but don't know) that ArgoCD knows that the managed Service will create an Endpoint and automatically tracks that as "this is managed".

But indepedently if ArgoCD things the EndpointSlice is managed or not, the updates aren't ignored (at least that's what we saw in the debug logs on the application controller pod).

mick1627 commented 8 months ago

Same for me ignoreResourceUpdates do not work on Orphaned Resources

  resource.customizations.ignoreResourceUpdates.autoscaling.k8s.io_VerticalPodAutoscalerCheckpoint: |
    jsonPointers:
    - /status
  resource.ignoreResourceUpdatesEnabled: 'true' 

I still see requesting app refresh after updated the configmap :

{"api-version":"autoscaling.k8s.io/v1","application":"argocd/poc-idp","cluster-name":"anthos-test-nprd","fields.level":1,"kind":"VerticalPodAutoscalerCheckpoint","level":"debug","msg":"Requesting app refresh caused by object update","name":"poc-idp-wordpress","namespace":"poc-idp","server":"https://XXXXX.central-1.amazonaws.com","time":"2024-02-06T17:11:33Z"}
diranged commented 7 months ago

So - I'm curious about this one... I understand how to ignore individual field updates on certain types of objects, but we operate a very fast-moving kubernetes cluster that launches between 200 and 400k pods daily. When we look at the log entries for "Requesting app refresh caused by object update" - we can see that we are getting 25 new pod updates per second.

image

Is there some way to make ArgoCD ignore Pod/EndpointSlice changes for the purpose of manifest comparison?

giepa commented 7 months ago

We are experiencing the same with resources managed by an operator

savar commented 7 months ago

Is there some way to make ArgoCD ignore Pod/EndpointSlice changes for the purpose of manifest comparison?

Just out of curiosity (for EndpointSlice's): did you try to do the two things?

  1. disable orphanedResources on your AppProject's
  2. ignore EndpointSlice's, like as an example:
    resource.customizations.ignoreResourceUpdates.discovery.k8s.io_EndpointSlice: |
    jsonPointers:
    - /metadata/annotations/renewTime
    - /metadata/resourceVersion

I am not sure if this will help in ignoring newly created things by an HPA but it would be interesting if it reduces it somehow.

husira commented 7 months ago

We have the same issue. ArgoCD Events App is creating the following configMap:

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"controller-manager-c8d4c76d-f6x4w_3c1ec9c1-9b77-43bb-a943-5b26247a33b6","leaseDurationSeconds":15,"acquireTime":"2024-03-03T19:44:57Z","renewTime":"2024-03-03T20:34:57Z","leaderTransitions":1}'
  creationTimestamp: "2024-03-03T19:44:38Z"
  name: argo-events-controller
  namespace: argo-events
  resourceVersion: "88954276"
  uid: 6a4dfeca-a51d-406a-a26b-486b3539313e

The renewTime of metadata.annotations."control-plane.alpha.kubernetes.io/leader" and the 'resourceVersion' always changes every few seconds.

As long as we have the following AppProject config, we have a reconciliation loop of ArgoEvents every few seconds:

  orphanedResources:
    warn: false

This leads to higher CPU usage of the Argocd-Application-Controller.

Also tried the "resource.ignoreResourceUpdates" inside the argocd-cm without any success (https://github.com/argoproj/argo-cd/issues/15594#issuecomment-1878577773)

Following the Debug-Log which shows that the reconciliation is triggered by this ConfigMap "argo-events-controller" from namespace argo-events:

argocd-application-controller-0 argocd-application-controller time="2024-03-03T20:40:33Z" level=debug msg="Checking if cluster https://kubernetes.default.svc with clusterShard 0 should be processed by shard 0"
argocd-application-controller-0 argocd-application-controller time="2024-03-03T20:40:33Z" level=debug msg="Requesting app refresh caused by object update" api-version=v1 application=argocd/argo-events cluster-name= fields.level=1 kind=ConfigMap name=argo-events-controller namespace=argo-events server="https://kubernetes.default.svc"
argocd-application-controller-0 argocd-application-controller time="2024-03-03T20:40:33Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argocd/argo-events
argocd-application-controller-0 argocd-application-controller time="2024-03-03T20:40:33Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: argo-events)" application=argocd/argo-events
argocd-application-controller-0 argocd-application-controller time="2024-03-03T20:40:33Z" level=info msg="No status changes. Skipping patch" application=argocd/argo-events
argocd-application-controller-0 argocd-application-controller time="2024-03-03T20:40:33Z" level=info msg="Reconciliation completed" application=argocd/argo-events dedup_ms=0 dest-name= dest-namespace=argo-events dest-server="https://kubernetes.default.svc" diff_ms=1 fields.level=1 git_ms=20 health_ms=1 live_ms=1 patch_ms=0 setop_ms=0 settings_ms=0 sync_ms=0 time_ms=43

Removing the orphanedResources inside the AppProject is a workaround, but I am surprised why orphaned resources trigger a reconciliation. It looks like a bug to me.

mtrin commented 6 months ago

I ended up completely excluding VPA with resource.exclusions

ptr1120 commented 4 months ago

Same problem for me with argo-cd and istio which maintains several configmaps with control-plane.alpha.kubernetes.io/leader settings. The ignoreResourceUpdates definition from https://github.com/argoproj/argo-cd/issues/15594#issue-1905156183 worked for me only after removing all orphanedResources from my projects and restarting the application-controller.

So why is orphanedResources interfering with these ignoreResourceUpdates definitions?

diranged commented 3 months ago

So this is really interesting to me - this issue is really old, and still happening. We just noticed that even though we have the /status field ignored on all of our resources, we still see every few seconds a HorizontalPodAutoscaler object update triggers a reconciliation:

  resource.customizations.ignoreResourceUpdates.all: |
    jsonPointers:
      - /status 
time="2024-06-14T15:42:35Z" level=debug msg="Requesting app refresh caused by object update" api-version=autoscaling/v2 application=argocd-system/... cluster-name= fields.level=0 kind=HorizontalPodAutoscaler name=.... namespace=otel server="https://kubernetes.default.svc"

Using kubectl-grep we watch the HPA object and the diff's are all in the supposedly ignored fields:

  apiVersion: "autoscaling/v2"
  kind: "HorizontalPodAutoscaler"
  metadata:
    creationTimestamp: "2024-06-03T02:55:49Z"
    name: "otel-collector-metrics-processor-collector"
    namespace: "otel"
    ownerReferences:
      -
        apiVersion: "opentelemetry.io/v1beta1"
        blockOwnerDeletion: true
        controller: true
        kind: "OpenTelemetryCollector"
        name: "...."
        uid: "f131b749-c70a-4fc9-a4e2-21aea2023410"
-   resourceVersion: "221899671"
+   resourceVersion: "221900017"
    uid: "a5432460-837e-4a89-85dd-1177034cf993"
  spec:
...
  status:
    conditions:
      -
        lastTransitionTime: "2024-06-03T02:56:04Z"
        message: "recommended size matches current size"
        reason: "ReadyForNewScale"
        status: "True"
        type: "AbleToScale"
      -
        lastTransitionTime: "2024-06-13T02:17:45Z"
        message: "the desired replica count is less than the minimum replica count"
        reason: "TooFewReplicas"
        status: "True"
        type: "ScalingLimited"
      -
        lastTransitionTime: "2024-06-11T08:44:44Z"
-       message: "the HPA was able to successfully calculate a replica count from memory resource utilization (percentage of request)"
+       message: "the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)"
        reason: "ValidMetricFound"
        status: "True"
        type: "ScalingActive"
    currentMetrics:
      -
        resource:
          current:
-           averageUtilization: 18
+           averageUtilization: 21
-           averageValue: "376m"
+           averageValue: "425m"
          name: "cpu"
        type: "Resource"
      -
        resource:
          current:
-           averageUtilization: 12
+           averageUtilization: 11
-           averageValue: "394474837333m"
+           averageValue: "367874048"
          name: "memory"
        type: "Resource"
    currentReplicas: 3
    desiredReplicas: 3
    lastScaleTime: "2024-06-09T23:33:28Z"

The above update should not be triggering a reconciliation because it's only updating the /status and /metadata/resourceVersion fields. Our configuration explicitly ignores /status and according to the docs the other field should be ignored too:

By default, the metadata fields generation, resourceVersion and managedFields are always ignored for all resources.

diranged commented 3 months ago

Following up on this - we see the same update behavior for all DaemonSets... any time a new pod is started, the /status field is updated... these shoudl be ignored, but they aren't and it triggers the app to be updated.

ronaknnathani commented 3 months ago

Looking at the code and from some experiments, it seems that this configuration only works for objects that are directly managed by argocd (applied to to the cluster from the manifest). This configuration doesn't work for objects that are in the resource tree but not directly tracked by ArgoCD.

phyzical commented 3 months ago

One alternate thought ive had while trying to debug why some of our helms get stuck in this issue is we could add some sort of argocd.argoproj.io/skip-reconcile-time: '300'

in theory this would be some sort of number set on each application and if it has been under this time then simply skip.

i.e argocd.argoproj.io/skip-reconcile-time: '300' would result in only 1 refresh in 5 minutes no matter what

I suppose the only exception we may want is manual refreshes will always run.

This would bet better then simply marking an application as skip entirely as it would atleast keep some sort of status/progression while not hamstringing the application server

santinoncs commented 2 months ago

@diranged this is happening to me also. I have few fields ignored but the resources still syncing.

For instance

  resource.customizations.ignoreResourceUpdates.keda.sh_ScaledObject: |
    jsonPointers:
    - /metadata/resourceVersion
    - /spec/triggers
    - /spec/cooldownPeriod
    - /spec/pollingInterval
    - /status/lastActiveTime

Not sure what to do next. I am using 2.10.3 version

Red-M commented 2 weeks ago

This not applying to orphan resources seems like an oversight, I'm on v2.12.3 and this is still happening, the current v2.13.0-rc1 isn't a solution either because I shouldn't have to tell argocd every single resource that it shouldn't be looking at.

agaudreault commented 2 weeks ago

As mentioned previously, this issue seemed to be caused by Orphan Resources being enabled, not a bug in ignoreResourceUpdates. As documented, the initial feature did not ignore updates for untracked resources, hence not for orphan resources.

With 2.13, you can configure https://argo-cd.readthedocs.io/en/latest/operator-manual/reconcile/#ignoring-updates-for-untracked-resources to specify that a resource should be evaluated against configured ignoreResourceUpdates, even if it is untracked. Note that this evaluation does consume some CPU, although significantly less than a reconcile. But it is the ultimate cost of monitoring orphanResources in namespaces known to have resources not managed by ArgoCD.

@Red-M The annotation needs to be added as an opt-in metadata, because if the ignoreResourceUpdates is evaluated on every resources of every kubernetes watch event, then Argo CD CPU usage will be higher than the CPU saved by not reconciling. This is what caused the initial implementation to only evaluate the ignoreResourceUpdates for tracked resources.

Without this, the behavior of orphan resource is to reconcile every Application where the target namespace is equal to the orphan resource's namespace, whenever the orphan resource changes. This means if you have orphan resources enabled, and you have 10 Applications that synchronize in kube-system, and 1 resource, not managed by any ArgoCD Application, that constantly change, your 10 Applications will constantly reconcile.

Best options are either to configure the annotation on the orphan resources known to have a high churn, to exclude that resource kind directly with resource.exclusions, or to not use orphan resource monitoring in namespaces where it is expected to have resources not managed by Argo.