VMCluster reports app healthy and synced on argocd when the status says otherwise

nonproto commented 4 weeks ago

Describe the bug

Deploy a VictoriaMetrics Operator and a VMCluster with STS VmSelect. Make a change to the statefulset declaration that is not allowed and sync.

Argo shows the app is healthy. But the VMCluster status shows the

status.clusterStatus: failed to create or update vmcluster cannot perform update on sts

To Reproduce

Don't have a config available

Version

VMOperator v0.33.4 App Version from chart is v0.46.4

Logs

No response

Screenshots

No response

Used command-line flags

No response

Additional information

No response

zekker6 commented 3 weeks ago

Moving this issue to operator repository as it seems like an operator issue.

f41gh7 commented 3 weeks ago

Hello, could you please check actual operator logs for the full error text?

nonproto commented 3 weeks ago

here is the sample vmselect portion under the vmcluster

vmselect:
    replicaCount: 1
    revisionHistoryLimitCount: 5
    cacheMountPath: "/select-cache"
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: my-storage-class
          resources:
            requests:
              storage: 1Gi

I deploy the cluster, everything is up and healthy. I change the revisionHistoryLimitCount from 5 -> 3

ArgoCD shows the VictoriaMetrics App is healthy and synced.

vmcluster is stuck in an expanding status and has the error

Normal   ReconcileEvent      <unknown>  victoria-metrics-operator  starting object update                                                                               
  Warning  ReconcilationError  <unknown>  victoria-metrics-operator  failed create or update vmcluster: cannot perform update on sts: vmselect-victoria-metrics, err: Stat
efulSet.apps "vmselect-victoria-metrics" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy', 'persist
entVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden

Operators log show

victoria-metrics-controller-vm-7b54fd596b-66jn4 vm {"level":"error","ts":"2024-08-16T12:36:39Z","logger":"manager","msg":"Reconciler error","controller":"vmcluster","controllerGroup":"operator.victoriametrics.com","controllerKind":"VMCluster","VMCluster":{"name":"victoria-metrics","namespace":"monitoring"},"namespace":"monitoring","name":"victoria-metrics","reconcileID":"14c5cb26-246f-401d-b6ac-730da082cc18","error":"failed create or update vmcluster: cannot perform update on sts: vmselect-victoria-metrics, err: StatefulSet.apps \"vmselect-victoria-metrics\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:222"}

f41gh7 commented 3 weeks ago

It's bug operator, changes to the revisionHistoryLimitCount is not properly handled. It case of any change to it, operator must perform statefulset delete operation. PTAL @Haleygo

Current workaround for it - execute manually command:

kubectl delete statefulset vmselect-victoria-metrics --cascade=false

It must remove only statefulset, but keep running pods. Operator will recreate statefulset at the next reconcile loop.

Argo-CD allows to configure custom health checks - https://argo-cd.readthedocs.io/en/stable/operator-manual/health/#custom-health-checks

With VMCluster, health should test obj.status.status == "operational"

nonproto commented 3 weeks ago

Yea that's what I did to get by it. I was more concerned that the operator didn't report to argocd that it was unhealthy

f41gh7 commented 2 weeks ago

Issue with revisionHistoryLimitCount was be fixed at https://github.com/VictoriaMetrics/operator/releases/tag/v0.47.0 release

For argo-cd health checks - it's better to use custom health check lua script for all VM CRD components:

data:
  resource.customizations: |
    operator.victoriametrics.com/*e:
      health.lua: |
        hs = {}
        hs.status = "Progressing"
        hs.message = "Waiting for reconcile"
        if obj.status ~= nil then
          if obj.status.status ~= nil then
               if obj.status.status == "operational"
                   hs.status = "Healthy"
                   hs.message = ""
                   return hs
               end
               if obj.status.status == "expanding"
                   return hs
               end
               if obj.status.status == "failed"
                  hs.status = "Degraded"
                   if obj.status.reason ~= nil
                        hs.message = obj.status.reason
                   end
                   if obj.status.lastSyncError ~= nil
                        hs.message = obj.status.lastSyncError
                   end
                  return hs
               end
          end
        end
        return hs

passie commented 2 weeks ago

After upgrading to the latest version 0.25.5 with operator 0.47.2 ArogCD is stuck on all statefulstets. After creating the pods, the pods are automaticly deleted and ArgoCD hangs. Even when deleting the pods with kubectl delete sts vmselect-vm --cascade=false

Argocd message:

 Pending deletion

Event logs filtered on VM

0m         Normal    SuccessfulCreate         statefulset/vmstorage-vm        create Pod vmstorage-vm-0 in StatefulSet vmstorage-vm successful
30m         Normal    Started                  pod/vmstorage-vm-1              Started container vmstorage
30m         Normal    Pulled                   pod/vmstorage-vm-1              Container image "repo.example.com/victoriametrics/vmstorage:v1.102.1-cluster" already present on machine
30m         Normal    Created                  pod/vmstorage-vm-1              Created container vmstorage
19m         Normal    Killing                  pod/vmstorage-vm-1              Stopping container vmstorage
18m         Normal    InjectionSkipped         statefulset/vmstorage-vm        Linkerd sidecar proxy injection skipped: neither the namespace nor the pod have the annotation "linkerd.io/inject:enabled"
18m         Warning   Unhealthy                pod/vmstorage-vm-1              Readiness probe failed: Get "http://192.168.4.135:8482/health": dial tcp 192.168.4.135:8482: connect: connection refused
18m         Normal    SuccessfulCreate         statefulset/vmstorage-vm        create Pod vmstorage-vm-0 in StatefulSet vmstorage-vm successful
18m         Normal    Scheduled                pod/vmstorage-vm-1              Successfully assigned victoria-metrics/vmstorage-vm-1 to shared-services-acc-shared-services-acc-zm4lp-55dcfd8d9d-vnbnr
18m         Normal    SuccessfulCreate         statefulset/vmstorage-vm        create Pod vmstorage-vm-1 in StatefulSet vmstorage-vm successful
18m         Normal    SuccessfulAttachVolume   pod/vmstorage-vm-1              AttachVolume.Attach succeeded for volume "pvc-920f4574-fa4e-46a3-a7d6-8d007b437b95"
18m         Normal    Pulled                   pod/vmstorage-vm-1              Container image "repo.example.com/victoriametrics/vmstorage:v1.102.1-cluster" already present on machine
18m         Normal    Created                  pod/vmstorage-vm-1              Created container vmstorage
5m27s       Warning   Unhealthy                pod/vmstorage-vm-1              Readiness probe failed: Get "http://192.168.4.137:8482/health": dial tcp 192.168.4.137:8482: connect: connection refused
18m         Normal    Started                  pod/vmstorage-vm-1              Started container vmstorage
5m50s       Normal    Killing                  pod/vmstorage-vm-1              Stopping container vmstorage
5m29s       Normal    InjectionSkipped         statefulset/vmstorage-vm        Linkerd sidecar proxy injection skipped: neither the namespace nor the pod have the annotation "linkerd.io/inject:enabled"
5m9s        Normal    InjectionSkipped         statefulset/vmstorage-vm        Linkerd sidecar proxy injection skipped: neither the namespace nor the pod have the annotation "linkerd.io/inject:enabled"
5m9s        Normal    Scheduled                pod/vmstorage-vm-1              Successfully assigned victoria-metrics/vmstorage-vm-1 to shared-services-acc-shared-services-acc-zm4lp-55dcfd8d9d-vnbnr
5m9s        Normal    SuccessfulCreate         statefulset/vmstorage-vm        create Pod vmstorage-vm-0 in StatefulSet vmstorage-vm successful
5m9s        Normal    SuccessfulCreate         statefulset/vmstorage-vm        create Pod vmstorage-vm-1 in StatefulSet vmstorage-vm successful
5m8s        Normal    SuccessfulAttachVolume   pod/vmstorage-vm-1              AttachVolume.Attach succeeded for volume "pvc-920f4574-fa4e-46a3-a7d6-8d007b437b95"
5m          Normal    Pulled                   pod/vmstorage-vm-1              Container image "repo.example.com/victoriametrics/vmstorage:v1.102.1-cluster" already present on machine
5m          Normal    Created                  pod/vmstorage-vm-1              Created container vmstorage
5m          Normal    Started                  pod/vmstorage-vm-1              Started container vmstorage
4m59s       Warning   Unhealthy                pod/vmstorage-vm-1              Readiness probe failed: Get "http://192.168.4.139:8482/health": dial tcp 192.168.4.139:8482: connect: connection refused
0s          Warning   ReconcilationError       vmalertmanager/vm               deletionTimestamp is not zero="2024-08-28 11:18:24 +0000 UTC" for object=victoria-metrics/vmalertmanager-vm kind=apps/v1, Kind=StatefulSet, recreating it at next reconcile loop. Warning never delete object manually
1s          Warning   ReconcilationError       vmalertmanager/vm               deletionTimestamp is not zero="2024-08-28 11:18:24 +0000 UTC" for object=victoria-metrics/vmalertmanager-vm kind=apps/v1, Kind=StatefulSet, recreating it at next reconcile loop. Warning never delete object manually
0s          Warning   ReconcilationError       vmalertmanager/vm               deletionTimestamp is not zero="2024-08-28 11:18:24 +0000 UTC" for object=victoria-metrics/vmalertmanager-vm kind=apps/v1, Kind=StatefulSet, recreating it at next reconcile loop. Warning never delete object manually
61m         Normal    InjectionSkipped         deployment/vmagent-vm           Linkerd sidecar proxy injection skipped: neither the namespace nor the pod have the annotation "linkerd.io/inject:enabled"

Would it be possible to reopen this issue?

f41gh7 commented 2 weeks ago

Could you please share statefulset definition for vmalertmanager/vmstorage?

Most probably, an issue related to the kubernetes version and default values for statefulset.spec.

f41gh7 commented 2 weeks ago

Looks like recent change to the revesionHistoryCount brings more harm than good. Especially when kubernetes doesn't even use this field.

f41gh7 commented 2 weeks ago

Issue exists for kubernetes version < 1.27.

We're going to create patch relase today

f41gh7 commented 2 weeks ago

Issue must be fixed at v0.47.3 release

passie commented 1 week ago

Maybe I missed it. When will the operator be released together with the helm chart? latest update on the operator I could find is 0.47.2 in the changelog

VictoriaMetrics / operator