argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.65k stars 5.38k forks source link

Inconsistent StatefulSet patching when container names change after disabling ServerSideApply and ServerSideDiff #19994

Open cwstrommer opened 3 weeks ago

cwstrommer commented 3 weeks ago

Checklist:

Describe the bug

Using client-side apply/diff on a StatefulSet whose single container changes name from one revision to the next, ArgoCD seems to patch the StatefulSet to contain two containers: the one with the old spec, and the one with the new spec. This causes CrashLoops on the StatefulSet's pods, as the pod specs are either broken or conflicting.

Notably, I can only reproduce this when the application was previously using both ServerSideApply and ServerSideDiff during an earlier revision.

To Reproduce

Tested with the VictoriaMetrics Agent chart https://github.com/VictoriaMetrics/helm-charts/tree/master/charts/victoria-metrics-agent.

Application spec:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  annotations:
    argocd.argoproj.io/compare-options: IncludeMutationWebhook=false,ServerSideDiff=true
  finalizers:
  - resources-finalizer.argocd.argoproj.io
  name: victoria-metrics-agent
  namespace: argocd
spec:
  destination:
    namespace: victoria-metrics-agent
    server: https://kubernetes.default.svc
  ignoreDifferences:
  - group: apps
    jsonPointers:
    - /spec/replicas
    kind: Deployment   
  - group: apps
    jqPathExpressions:
    - .spec.volumeClaimTemplates[]?
    - .spec.volumeClaimTemplates[].spec.resources.requests.storage
    - .spec.volumeClaimTemplates[].metadata
    kind: StatefulSet
  - group: '*'   
    jsonPointers:
    - /metadata/managedFields
    kind: '*'  
  - jqPathExpressions:
    - .metadata.annotations | select(has("cloud.google.com"))
    kind: Service   
  - group: autoscaling
    jqPathExpressions:
    - .status
    kind: HorizontalPodAutoscaler
  project: victoria-metrics-agent
  source:
    helm:
      releaseName: victoria-metrics-agent
      valueFiles:                                                                                         
      - values.yaml
      values: |
        victoria-metrics-agent:
          remoteWriteUrls:
            - "http://victoria-metrics-proxy.victoria-metrics-proxy.svc.cluster.local:8429/api/v1/write"
    path: victoria-metrics-agent-rc
    repoURL: <redacted>
    targetRevision: debug/vmagent-sts
  syncPolicy:
    automated:
      selfHeal: true
    retry:    
      backoff:
        duration: 5s
        factor: 2          
        maxDuration: 3m
      limit: 5
    syncOptions:
    - CreateNamespace=false
    - ServerSideApply=true

Chart.yaml

apiVersion: v2
name: victoria-metrics-agent
version: 1.0.1

dependencies:
  - name: victoria-metrics-agent
    version: "0.10.14"
    repository: https://victoriametrics.github.io/helm-charts/

values.yaml

---

victoria-metrics-agent:
  resources:
  deployment:
    enabled: false

  statefulset:
    enabled: true

  persistence:
    enabled: false
    size: 1Gi

  replicaCount: 2

  service:
    enabled: true

  config:
    scrape_configs: []

  rbac:
    create: false
    pspEnabled: false

  podAnnotations:
    relabelConfig.revision: v4
    prometheus.io/scrape: "true"
    prometheus.io/port: "8429"

  affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                    - victoria-metrics-agent
            topologyKey: kubernetes.io/hostname

Deploy the ArgoCD application, then upgrade the dependency in Chart.yaml to version 0.12.2. This will put the application into an unknown sync state, with:

ComparisonError  Failed to compare desired state to live state: failed to calculate diff: error calculating server side diff: serverSideDiff error: error removing non config mutations for resource StatefulSet/victoria-metrics-agent: error reverting webhook removed fields in predicted live resource: .spec.template.spec.containers: element 0: associative list with keys may not have a null element  2024-09-18 19:26:40 -0400 EDT

Note: the above error appears only when both ServerSideApply and ServerSideDiff are enabled. If the application is created with ServerSideApply=false and ServerSideDiff=true, then there is no ComparisonError.

Disable both ServerSideDiff and ServerSideApply, and when the new version of the StatefulSet is created after a new sync, the new pods have two containers instead of one:

kubectl get all -n victoria-metrics-agent
NAME                            READY   STATUS    RESTARTS        AGE
pod/victoria-metrics-agent-0   1/1     Running   0               12m
pod/victoria-metrics-agent-1   1/2     Error     7 (5m19s ago)   11m

NAME                              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/victoria-metrics-agent   ClusterIP   10.18.134.29   <none>        8429/TCP   64m

NAME                                       READY   AGE
statefulset.apps/victoria-metrics-agent   1/2     22m

Inspecting the StatefulSet shows two containers specified in the pod template, one with name vmagent (new name, expected), and one with name victoria-metrics-agent (old name, should no longer exist).

It is also possible to reproduce this by first disabling ServerSideApply and ServerSideDiff, and then changing the dependency version.

Expected behavior

ArgoCD removes parts of the spec that don't exist.

Version

argocd: v2.6.3+e05298b
  BuildDate: 2023-02-27T15:02:57Z
  GitCommit: e05298b9c6ab8610104271fa8491f019fee3c587
  GitTreeState: clean
  GoVersion: go1.18.10
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.12.3+6b9cd82
  BuildDate: 2024-08-27T11:57:48Z
  GitCommit: 6b9cd828c6e9807398869ad5ac44efd2c28422d6
  GitTreeState: clean
  GoVersion: go1.22.4
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v5.4.2 2024-05-22T15:19:38Z
  Helm Version: v3.15.2+g1a500d5
  Kubectl Version: v0.29.6
  Jsonnet Version: v0.20.0
cwstrommer commented 1 week ago

Note: I just ran into the same issue with a deployment. Same conditions: changed from server-side diff/apply to client-side, container name changes. The only difference is that this time both containers were up to spec, and the only reason we caught it is that the pods were unschedulable because each container was requesting more than half a node's available resources.