argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.96k stars 5.47k forks source link

OutOfSync when deployment has topologySpreadConstraints and resource limits #16400

Closed imeliran closed 20 hours ago

imeliran commented 12 months ago

Checklist:

Describe the bug We noticed that deployments that are a part of the Application become OutOfSync if they have topologySpreadConstraints as part of specs . when looking at the diff pane it shows unrelated diff regarding the resource limits: image

To Reproduce Create argo application that deploy the below manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: devops
  name: petclinic-demo-app-1
  labels:
    app: petclinic-demo-app-1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: petclinic-demo-app-1
  template:
    metadata:
      labels:
        app: petclinic-demo-app-1
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          minDomains: 1
          topologyKey: "kubernetes.io/hostname"
          whenUnsatisfiable: "DoNotSchedule"
          labelSelector:
            matchLabels:
              app: petclinic-demo-app-1
          matchLabelKeys:
            - pod-template-hash
          nodeAffinityPolicy: Honor
          nodeTaintsPolicy: Ignore
      containers:
        - name: app
          image: jbrisbin/spring-petclinic
          env:
          - name: "_JAVA_OPTIONS"
            value: "-Xmx1G -Xms1G -XX:MaxDirectMemorySize=64m"
          resources:
            limits:
              cpu: 1000m
              memory: 1Gi
            requests:
              cpu: 1000m
              memory: 1Gi

And you should see app outofsync with a diff of limits. 2) Remove the topologySpreadConstraints from deployment :

      topologySpreadConstraints:
        - maxSkew: 1
          minDomains: 1
          topologyKey: "kubernetes.io/hostname"
          whenUnsatisfiable: "DoNotSchedule"
          labelSelector:
            matchLabels:
              app: petclinic-demo-app-1
          matchLabelKeys:
            - pod-template-hash
          nodeAffinityPolicy: Honor
          nodeTaintsPolicy: Ignore

And you should see that outofsync is no longer seen.

Expected behavior App should be synced properly in terms of resource limits regardless the existence of topologySpreadConstraints

Screenshots

Version

v2.9.1+58b04e5

Logs

Paste any relevant application logs here.
ashutosh16 commented 10 months ago

I have seen a similar issue in one environment where resource limits diffing, however, in another environment with an identical configuration there is no difference. The spec doesn't contain any extra fields like topologySpreadConstraints

resources:
            limits:
              cpu: '4'
              memory: 2Gi
            requests:
              cpu: 500m
              memory: 2Gi

Screenshot 2024-01-08 at 2 34 45 PM

rwong2888 commented 10 months ago

I am getting the same issue.

If I remove matchLabelKeys from topologySpreadConstraints, it reverts back to the regular behaviour.

matchLabelKeys:
- pod-template-hash
tooptoop4 commented 9 months ago

@imeliran @ashutosh16 @rwong2888 what k8s version are u using? i wonder if this relates to https://github.com/argoproj/argo-cd/issues/15176

rwong2888 commented 9 months ago

I am on 1.27 @tooptoop4. I suspect the same. Was discussing with @crenshaw-dev and maybe in 2.11 it will get resolved.

tianqi-z commented 9 months ago

I was on Kubernetes 1.27, both matchLabelKeys and nodeTaintsPolicy is causing outofsync. The diff is about empty or default fields like: nodeSelector: {} tolerations: [] hostNetwork: false .etc. Argo version 2.7.7 cc @rwong2888 @crenshaw-dev

rob-whittle commented 9 months ago

I have seen a similar issue in one environment where resource limits diffing, however, in another environment with an identical configuration there is no difference. The spec doesn't contain any extra fields like topologySpreadConstraints

resources:
            limits:
              cpu: '4'
              memory: 2Gi
            requests:
              cpu: 500m
              memory: 2Gi

Screenshot 2024-01-08 at 2 34 45 PM

We are also seeing this issue on one of our kubernetes clusters. Working fine on over environments with identical configuration though.

image
adam-harries-hub commented 8 months ago

My team is also hitting this bug when including topologySpreadConstraints in the spec and using whole CPU values for requests/limits.

henryzhao95 commented 8 months ago

Also having this issue on v2.9.5+f943664, where the sync was happy when the desired manifest has resources.limits.cpu: 1000m and the live manifest has resources.limits.cpu: '1'. But something made it unhappy - not sure if it was the addition of the nodeTaintsPolicy to our topology spread constraints.

emmahsax commented 7 months ago

Same issue on 2.7.9. The issue is with any field that's still in beta, so that includes nodeAffinityPolicy, nodeTaintsPolicy, and matchLabelKeys.

Hopefully they resolve this soon!

pankajkumar911 commented 4 months ago

Looks like the same issue persists on Kubernetes version 1.29 and argo v2.9.6 The flag matchLabelKeys is still in beta according to feature gates page.

image

easterbrennan commented 3 months ago

Hitting a similar issue here with k8s 1.28 and argo v2.7.14.

Initially hit a problem with setting InitialDelaySeconds: 0 on probes and now hitting similar resource diffs

image

RLProteus commented 3 months ago

Can confirm I'm seeing this as well when adding topologySpreadConstraints to a deployment. Seeing both the InitialDelaySeconds: 0 issue as well as resource request/limits marshaling failures on values that worked previously.

ArgoCD v2.9.6 and K8s 1.28

SavaMihai commented 2 months ago

Hello, I'm having the same issue when using topologySpreadConstraints all the apps go Out Of Sync ...

ashinsabu3 commented 2 months ago

Fyi, I saw this fixed when I upgraded to 2.10.x, couldn't pinpoint the commit which fixed this. Maybe some library upgrade could've

benniwiit commented 1 month ago

Hi, we are facing the same issue in v2.11.4+e1284e1. We need to calculate the CPU limit based on the request, so we need integer. ArgoCD then always shows OutOfSync: image

jinleileiking commented 1 month ago

workaround

image
andrii-korotkov-verkada commented 20 hours ago

You can configure known types following https://argo-cd.readthedocs.io/en/stable/user-guide/diffing/#known-kubernetes-types-in-crds-resource-limits-volume-mounts-etc to deal with this. I'll close this, but feel free to re-open if this is not enough.