Closed nonproto closed 2 weeks ago
Moving this issue to operator repository as it seems like an operator issue.
Hello, could you please check actual operator logs for the full error text?
here is the sample vmselect
portion under the vmcluster
vmselect:
replicaCount: 1
revisionHistoryLimitCount: 5
cacheMountPath: "/select-cache"
storage:
volumeClaimTemplate:
spec:
storageClassName: my-storage-class
resources:
requests:
storage: 1Gi
I deploy the cluster, everything is up and healthy. I change the revisionHistoryLimitCount from 5 -> 3
ArgoCD shows the VictoriaMetrics App is healthy and synced.
vmcluster
is stuck in an expanding status and has the error
Normal ReconcileEvent <unknown> victoria-metrics-operator starting object update
Warning ReconcilationError <unknown> victoria-metrics-operator failed create or update vmcluster: cannot perform update on sts: vmselect-victoria-metrics, err: Stat
efulSet.apps "vmselect-victoria-metrics" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy', 'persist
entVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden
Operators log show
victoria-metrics-controller-vm-7b54fd596b-66jn4 vm {"level":"error","ts":"2024-08-16T12:36:39Z","logger":"manager","msg":"Reconciler error","controller":"vmcluster","controllerGroup":"operator.victoriametrics.com","controllerKind":"VMCluster","VMCluster":{"name":"victoria-metrics","namespace":"monitoring"},"namespace":"monitoring","name":"victoria-metrics","reconcileID":"14c5cb26-246f-401d-b6ac-730da082cc18","error":"failed create or update vmcluster: cannot perform update on sts: vmselect-victoria-metrics, err: StatefulSet.apps \"vmselect-victoria-metrics\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:222"}
It's bug operator, changes to the revisionHistoryLimitCount is not properly handled. It case of any change to it, operator must perform statefulset delete operation. PTAL @Haleygo
Current workaround for it - execute manually command:
kubectl delete statefulset vmselect-victoria-metrics --cascade=false
It must remove only statefulset, but keep running pods. Operator will recreate statefulset at the next reconcile loop.
Argo-CD allows to configure custom health checks - https://argo-cd.readthedocs.io/en/stable/operator-manual/health/#custom-health-checks
With VMCluster, health should test obj.status.status == "operational"
Yea that's what I did to get by it. I was more concerned that the operator didn't report to argocd that it was unhealthy
Issue with revisionHistoryLimitCount was be fixed at https://github.com/VictoriaMetrics/operator/releases/tag/v0.47.0 release
For argo-cd health checks - it's better to use custom health check lua script for all VM CRD components:
data:
resource.customizations: |
operator.victoriametrics.com/*e:
health.lua: |
hs = {}
hs.status = "Progressing"
hs.message = "Waiting for reconcile"
if obj.status ~= nil then
if obj.status.status ~= nil then
if obj.status.status == "operational"
hs.status = "Healthy"
hs.message = ""
return hs
end
if obj.status.status == "expanding"
return hs
end
if obj.status.status == "failed"
hs.status = "Degraded"
if obj.status.reason ~= nil
hs.message = obj.status.reason
end
if obj.status.lastSyncError ~= nil
hs.message = obj.status.lastSyncError
end
return hs
end
end
end
return hs
After upgrading to the latest version 0.25.5 with operator 0.47.2 ArogCD is stuck on all statefulstets. After creating the pods, the pods are automaticly deleted and ArgoCD hangs. Even when deleting the pods with kubectl delete sts vmselect-vm --cascade=false
Argocd message:
Pending deletion
Event logs filtered on VM
0m Normal SuccessfulCreate statefulset/vmstorage-vm create Pod vmstorage-vm-0 in StatefulSet vmstorage-vm successful
30m Normal Started pod/vmstorage-vm-1 Started container vmstorage
30m Normal Pulled pod/vmstorage-vm-1 Container image "repo.example.com/victoriametrics/vmstorage:v1.102.1-cluster" already present on machine
30m Normal Created pod/vmstorage-vm-1 Created container vmstorage
19m Normal Killing pod/vmstorage-vm-1 Stopping container vmstorage
18m Normal InjectionSkipped statefulset/vmstorage-vm Linkerd sidecar proxy injection skipped: neither the namespace nor the pod have the annotation "linkerd.io/inject:enabled"
18m Warning Unhealthy pod/vmstorage-vm-1 Readiness probe failed: Get "http://192.168.4.135:8482/health": dial tcp 192.168.4.135:8482: connect: connection refused
18m Normal SuccessfulCreate statefulset/vmstorage-vm create Pod vmstorage-vm-0 in StatefulSet vmstorage-vm successful
18m Normal Scheduled pod/vmstorage-vm-1 Successfully assigned victoria-metrics/vmstorage-vm-1 to shared-services-acc-shared-services-acc-zm4lp-55dcfd8d9d-vnbnr
18m Normal SuccessfulCreate statefulset/vmstorage-vm create Pod vmstorage-vm-1 in StatefulSet vmstorage-vm successful
18m Normal SuccessfulAttachVolume pod/vmstorage-vm-1 AttachVolume.Attach succeeded for volume "pvc-920f4574-fa4e-46a3-a7d6-8d007b437b95"
18m Normal Pulled pod/vmstorage-vm-1 Container image "repo.example.com/victoriametrics/vmstorage:v1.102.1-cluster" already present on machine
18m Normal Created pod/vmstorage-vm-1 Created container vmstorage
5m27s Warning Unhealthy pod/vmstorage-vm-1 Readiness probe failed: Get "http://192.168.4.137:8482/health": dial tcp 192.168.4.137:8482: connect: connection refused
18m Normal Started pod/vmstorage-vm-1 Started container vmstorage
5m50s Normal Killing pod/vmstorage-vm-1 Stopping container vmstorage
5m29s Normal InjectionSkipped statefulset/vmstorage-vm Linkerd sidecar proxy injection skipped: neither the namespace nor the pod have the annotation "linkerd.io/inject:enabled"
5m9s Normal InjectionSkipped statefulset/vmstorage-vm Linkerd sidecar proxy injection skipped: neither the namespace nor the pod have the annotation "linkerd.io/inject:enabled"
5m9s Normal Scheduled pod/vmstorage-vm-1 Successfully assigned victoria-metrics/vmstorage-vm-1 to shared-services-acc-shared-services-acc-zm4lp-55dcfd8d9d-vnbnr
5m9s Normal SuccessfulCreate statefulset/vmstorage-vm create Pod vmstorage-vm-0 in StatefulSet vmstorage-vm successful
5m9s Normal SuccessfulCreate statefulset/vmstorage-vm create Pod vmstorage-vm-1 in StatefulSet vmstorage-vm successful
5m8s Normal SuccessfulAttachVolume pod/vmstorage-vm-1 AttachVolume.Attach succeeded for volume "pvc-920f4574-fa4e-46a3-a7d6-8d007b437b95"
5m Normal Pulled pod/vmstorage-vm-1 Container image "repo.example.com/victoriametrics/vmstorage:v1.102.1-cluster" already present on machine
5m Normal Created pod/vmstorage-vm-1 Created container vmstorage
5m Normal Started pod/vmstorage-vm-1 Started container vmstorage
4m59s Warning Unhealthy pod/vmstorage-vm-1 Readiness probe failed: Get "http://192.168.4.139:8482/health": dial tcp 192.168.4.139:8482: connect: connection refused
0s Warning ReconcilationError vmalertmanager/vm deletionTimestamp is not zero="2024-08-28 11:18:24 +0000 UTC" for object=victoria-metrics/vmalertmanager-vm kind=apps/v1, Kind=StatefulSet, recreating it at next reconcile loop. Warning never delete object manually
1s Warning ReconcilationError vmalertmanager/vm deletionTimestamp is not zero="2024-08-28 11:18:24 +0000 UTC" for object=victoria-metrics/vmalertmanager-vm kind=apps/v1, Kind=StatefulSet, recreating it at next reconcile loop. Warning never delete object manually
0s Warning ReconcilationError vmalertmanager/vm deletionTimestamp is not zero="2024-08-28 11:18:24 +0000 UTC" for object=victoria-metrics/vmalertmanager-vm kind=apps/v1, Kind=StatefulSet, recreating it at next reconcile loop. Warning never delete object manually
61m Normal InjectionSkipped deployment/vmagent-vm Linkerd sidecar proxy injection skipped: neither the namespace nor the pod have the annotation "linkerd.io/inject:enabled"
Would it be possible to reopen this issue?
Could you please share statefulset definition for vmalertmanager/vmstorage?
Most probably, an issue related to the kubernetes version and default values for statefulset.spec.
Looks like recent change to the revesionHistoryCount brings more harm than good. Especially when kubernetes doesn't even use this field.
Issue exists for kubernetes version < 1.27.
We're going to create patch relase today
Issue must be fixed at v0.47.3 release
Maybe I missed it. When will the operator be released together with the helm chart? latest update on the operator I could find is 0.47.2 in the changelog
Describe the bug
Deploy a VictoriaMetrics Operator and a VMCluster with STS VmSelect. Make a change to the statefulset declaration that is not allowed and sync.
Argo shows the app is healthy. But the VMCluster status shows the
To Reproduce
Don't have a config available
Version
VMOperator
v0.33.4
App Version from chart isv0.46.4
Logs
No response
Screenshots
No response
Used command-line flags
No response
Additional information
No response