Open tman5 opened 5 months ago
Thanks. Would it be possible to attach the operator log as a file to this case? I would like to see if there is an issue with operator reconciliation. If you can access rook logs, please attach those as well.
@tman5 , could you show your storage classes?
kubectl get storageclasses -o wide
And it would be useful to see one of PVCs created by an operator.
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
ceph-bucket rook-ceph.ceph.rook.io/bucket Delete Immediate false 208d
ceph-filesystem rook-ceph.cephfs.csi.ceph.com Delete Immediate true 208d
rook-ceph-block (default) rook-ceph.rbd.csi.ceph.com Delete Immediate true 208d
sc-smb-mssql-database-repos smb.csi.k8s.io Retain Immediate false 182d
sc-smb-mssql-deploy-scripts smb.csi.k8s.io Retain Immediate false 182d
sc-smb-mssql-wss smb.csi.k8s.io Retain Immediate false 182d
This is one of the PVCs that will perpetually be in a terminating state:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
volume.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
creationTimestamp: "2024-04-02T12:05:28Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2024-04-02T12:05:34Z"
finalizers:
- kubernetes.io/pvc-protection
labels:
argocd.argoproj.io/instance: featbit-clickhouse-dev2
clickhouse.altinity.com/app: chop
clickhouse.altinity.com/chi: clickhouse
clickhouse.altinity.com/cluster: replicated
clickhouse.altinity.com/namespace: clark-developer-featbit
clickhouse.altinity.com/object-version: 241ccf05924775f258c440aecb86eecc549bb3ce
clickhouse.altinity.com/reclaimPolicy: Retain
clickhouse.altinity.com/replica: "0"
clickhouse.altinity.com/shard: "0"
name: default-chi-clickhouse-replicated-0-0-0
namespace: clark-developer-featbit
resourceVersion: "298826497"
uid: f9ea50da-82a6-47b9-9231-8a53022d5d03
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 60Gi
storageClassName: rook-ceph-block
volumeMode: Filesystem
volumeName: pvc-f9ea50da-82a6-47b9-9231-8a53022d5d03
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 60Gi
phase: Bound
E0402 12:08:00.875175 1 creator.go:175] updatePersistentVolumeClaim():clark-developer-featbit/default-chi-clickhouse-replicated-0-1-0:unable to Update PVC err: Operation cannot be fulfilled on persistentvolumeclaims "default-chi-clickhouse-replicated-0-1-0": the object has been modified; please apply your changes to the latest version and try again
E0402 12:08:00.875219 1 worker-chi-reconciler.go:1000] reconcilePVCFromVolumeMount():ERROR unable to reconcile PVC(clark-developer-featbit/default-chi-clickhouse-replicated-0-1-0) err: Operation cannot be fulfilled on persistentvolumeclaims "default-chi-clickhouse-replicated-0-1-0": the object has been modified; please apply your changes to the latest version and try again
it means someone like ArgoCD changed PVC
could you try to deploy CHI without argocd and try to rescale?
Is there a way to make it work with argo?
Errors can not lead to PVC deletion. I wonder if this is actually ArgoCD that deleted it?
@tman5 Assuming you are using Argo CD can you describe how you have configured CI/CD and exactly what are the steps you apply to make a change to volume size? It seems possible that multiple actors are trying to manage the CHI resources or at least the underlying volume.
p.s., Argo CD normally is fine with changes to storage size. I've done it many times on AWS EBS volumes.
This is my argo-cd config:
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: clickhouse
namespace: argo-cd
spec:
destination:
namespace: clickhouse
server: https://kube-server
project: dev
source:
path: ./overlays/dev1/clickhouse
repoURL: https://repo.local
targetRevision: master
syncPolicy:
automated:
prune: true
selfHeal: true
retry:
backoff:
duration: 5s
factor: 2
maxDuration: 3m0s
limit: 2
syncOptions:
- CreateNamespace=true
- PruneLast=true
- PrunePropagationPolicy=foreground
- ServerSideApply=true
- --sync-hook-timeout=60s
- --sync-wait=60s
It points to a repo that has a kustomize file:
---
kind: Kustomization
apiVersion: kustomize.config.k8s.io/v1beta1
resources:
- ../../../base/clickhouse-keeper/
- ../clickhouse-operator/
- manifest.yml
- clickhouse-backup-rw-password.yml
namespace: clickhouse
...
Then the manifest file is what i posted above. I edit the PVC size in that manifest, commit it to the repo and then let argo do it's thing
In the clickhouse-operator directory, this is the kustomize file:
---
kind: Kustomization
apiVersion: kustomize.config.k8s.io/v1beta1
helmCharts:
- name: altinity-clickhouse-operator
releaseName: clickhouse-operator
namespace: clickhouse
repo: https://docs.altinity.com/clickhouse-operator/
version: 0.22.2
valuesInline:
configs:
configdFiles:
01-clickhouse-02-logger.xml: |
<!-- IMPORTANT -->
<!-- This file is auto-generated -->
<!-- Do not edit this file - all changes would be lost -->
<!-- Edit appropriate template in the following folder: -->
<!-- deploy/builder/templates-config -->
<!-- IMPORTANT -->
<yandex>
<logger>
<!-- Possible levels: https://github.com/pocoproject/poco/blob/develop/Foundation/include/Poco/Logger.h#L105 -->
<level>warning</level>
<log>/var/log/clickhouse-server/clickhouse-server.log</log>
<errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
<size>1000M</size>
<count>10</count>
<!-- Default behavior is autodetection (log to console if not daemon mode and is tty) -->
<console>1</console>
</logger>
</yandex>
...
@tman5 , it is possible that there is a conflict between ArgoCD and operator. Try altering operator configuration in order to remove labels from dependent objects, including PVCs:
apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseOperatorConfiguration"
metadata:
name: "exclude-argocd-label"
spec:
label:
exclude:
- argocd.argoproj.io/instance
When trying to expand the PVC volume template the operator will delete/re-create the PVC volumes instead of just resizing them. We are using Rook-Ceph as the storage provider and have successfully resized PVCs without delete/re-create. We can also manually edit the PVC itself and it will expand. We are using version 0.22.2 of the operator. I've reproduced it in multiple clusters.
We have tried it without the
storageManagement
options as well and it just results in a loop where the operator will continually try to delete/re-create the PVCs