Closed Fenrur closed 8 months ago
I wanted to provide an additional answer, no problem when you use a storageclass in ReadWriteMany, like ceph-filesystem storageclass.
Hi @Fenrur
I don't exactly know the internals of Rook ceph block storage, but to me it all points in that direction. As you stated, the chart is designed so that the backup process is executed correctly even is the volumes used are ReadWriteOnce
.
On my side, I am also using a 3-node cluster (without Rook, bear this in mind) and everything is executing smoothly. Can we get the output of the following commands to try and see what is happening?
# Get the list of PV, PVC and Bindings before the installation happens
$ kubectl get volumeattachments
$ kubectl get pv
$ kubectl get pvc
# Install the chart using your custom params and verify the job is scheduled
$ helm install influxdb influxdb --repo https://charts.bitnami.com/bitnami -f values.yaml
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
influxdb-55c9d64f8-sdv55 0/1 ContainerCreating 0 9s
influxdb-backup-28371687-gs6c8 0/1 Init:0/1 0 8s
# Get the list of PV, PVC and Bindings again WHEN THE JOB IS SCHEDULED to run.
$ kubectl get volumeattachments
Warning: Use tokens from the TokenRequest API or manually created secret-based tokens instead of auto-generated secret-based tokens.
NAME ATTACHER PV NODE ATTACHED AGE
csi-c3621afd609527601412cc91ea6038629c6d1a4512a42bf74ec03f6360e2bcd3 pd.csi.storage.gke.io pvc-eaadd109-b70f-488a-a5a1-3aa1f793bb9f gke-environment-0prc-nodepool-workers-5dcee8f3-md7f true 11s
csi-ce36e0c5c8c6f0ddcc72e423a0d1d4f41ccce2e1f289296a97a13bb9c2ce57ae pd.csi.storage.gke.io pvc-e648e7dd-69f0-4638-981f-ed986636a065 gke-environment-0prc-nodepool-workers-5dcee8f3-rq01 true 4m6s
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-e648e7dd-69f0-4638-981f-ed986636a065 8Gi RWO Delete Bound test/influxdb standard-rwo 67s
pvc-eaadd109-b70f-488a-a5a1-3aa1f793bb9f 8Gi RWO Delete Bound test/influxdb-backups standard-rwo 67s
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
influxdb Bound pvc-e648e7dd-69f0-4638-981f-ed986636a065 8Gi RWO standard-rwo 86s
influxdb-backups Bound pvc-eaadd109-b70f-488a-a5a1-3aa1f793bb9f 8Gi RWO standard-rwo 86s
@joancafom
$ kubectl get volumeattachments
NAME ATTACHER PV NODE ATTACHED AGE
csi-6947c506366e2c62840a1be8a880d604e780889a635060f1ea4cca5e2579625c rook-ceph.rbd.csi.ceph.com pvc-086f8381-176a-4239-9c7c-9fc34e111d98 k8s-3 true 2m58s
csi-9be9cb9db1e513e04c4a487676f6553d6a9f4c0f6382e94da33f525c7c604a8b rook-ceph.rbd.csi.ceph.com pvc-3f352d2a-a92e-4d71-af77-1c2943a0a993 k8s-1 true 2m10s
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-086f8381-176a-4239-9c7c-9fc34e111d98 8Gi RWO Delete Bound gtb-supervision/influxdb ceph-block 3m25s
pvc-3f352d2a-a92e-4d71-af77-1c2943a0a993 8Gi RWO Delete Bound gtb-supervision/influxdb-backups ceph-block 3m25s
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
influxdb Bound pvc-086f8381-176a-4239-9c7c-9fc34e111d98 8Gi RWO ceph-block 4m2s
influxdb-backups Bound pvc-3f352d2a-a92e-4d71-af77-1c2943a0a993 8Gi RWO ceph-block 4m2s
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
influxdb-748bc7d557-nfm5b 1/1 Running 0 4m32s
influxdb-backup-28371719-69kdr 0/2 Init:CrashLoopBackOff 5 (50s ago) 3m43s
$ kubectl describe pod influxdb-backup-28371724-blxj2
Name: influxdb-backup-28371724-blxj2
Namespace: gtb-supervision
Priority: 0
Service Account: default
Node: k8s-2/172.18.8.102
Start Time: Mon, 11 Dec 2023 15:04:45 +0100
Labels: app.kubernetes.io/instance=influxdb
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=influxdb
app.kubernetes.io/version=2.7.4
batch.kubernetes.io/controller-uid=abe774b5-a8cd-4b37-9970-a8205962b5bc
batch.kubernetes.io/job-name=influxdb-backup-28371724
controller-uid=abe774b5-a8cd-4b37-9970-a8205962b5bc
helm.sh/chart=influxdb-5.11.0
job-name=influxdb-backup-28371724
kuik.enix.io/images-rewritten=true
Annotations: original-image-aws-cli: docker.io/bitnami/aws-cli:2.13.30-debian-11-r0
original-image-influxdb-backup-dummy-container: docker.io/bitnami/influxdb:2.7.4-debian-11-r0
original-init-image-influxdb-backup: docker.io/bitnami/influxdb:2.7.4-debian-11-r0
Status: Pending
IP: 10.233.65.184
IPs:
IP: 10.233.65.184
Controlled By: Job/influxdb-backup-28371724
Init Containers:
influxdb-backup:
Container ID: containerd://2666f3a8a9a6b21f91c457a0d3f21993d6fcac3999c7afd64dd9f8cdc16d5255
Image: localhost:7439/docker.io/bitnami/influxdb:2.7.4-debian-11-r0
Image ID: docker.io/bitnami/influxdb@sha256:cdf692c3adf2a7fa7e6aa4f45300596696a13de51b07b02bf2a85d22ff909d09
Port: <none>
Host Port: <none>
SeccompProfile: RuntimeDefault
Command:
/tmp/backup.sh
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 11 Dec 2023 15:05:12 +0100
Finished: Mon, 11 Dec 2023 15:05:12 +0100
Ready: False
Restart Count: 2
Environment:
INFLUXDB_ADMIN_USER_PASSWORD: <set to the key 'admin-user-password' in secret 'influxdb'> Optional: false
INFLUXDB_ADMIN_USER_TOKEN: <set to the key 'admin-user-token' in secret 'influxdb'> Optional: false
Mounts:
/backups from influxdb-backups (rw)
/tmp/backup.sh from backup-scripts (rw,path="backup.sh")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dghhk (ro)
Containers:
influxdb-backup-dummy-container:
Container ID:
Image: localhost:7439/docker.io/bitnami/influxdb:2.7.4-debian-11-r0
Image ID:
Port: <none>
Host Port: <none>
SeccompProfile: RuntimeDefault
Command:
/bin/true
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dghhk (ro)
aws-cli:
Container ID:
Image: localhost:7439/docker.io/bitnami/aws-cli:2.13.30-debian-11-r0
Image ID:
Port: <none>
Host Port: <none>
SeccompProfile: RuntimeDefault
Command:
/tmp/upload-aws.sh
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
AWS_ACCESS_KEY_ID: <set to the key 'accessKeyID' in secret 'influxdb-backup-bucket-s3'> Optional: false
AWS_SECRET_ACCESS_KEY: <set to the key 'secretAccessKey' in secret 'influxdb-backup-bucket-s3'> Optional: false
AWS_DEFAULT_REGION: <set to the key 'region' in secret 'influxdb-backup-bucket-s3'> Optional: false
AWS_ENDPOINT_URL_S3: <set to the key 'endpoint' in secret 'influxdb-backup-bucket-s3'> Optional: false
Mounts:
/backups from influxdb-backups (rw)
/tmp/upload-aws.sh from backup-scripts (rw,path="upload-aws.sh")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dghhk (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
backup-scripts:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: influxdb-backup
Optional: false
influxdb-backups:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: influxdb-backups
ReadOnly: false
kube-api-access-dghhk:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 46s default-scheduler Successfully assigned gtb-supervision/influxdb-backup-28371724-blxj2 to k8s-2
Warning FailedAttachVolume 46s attachdetach-controller Multi-Attach error for volume "pvc-3f352d2a-a92e-4d71-af77-1c2943a0a993" Volume is already exclusively attached to one node and can't be attached to another
Normal SuccessfulAttachVolume 40s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-3f352d2a-a92e-4d71-af77-1c2943a0a993"
Normal Pulled 19s (x3 over 35s) kubelet Container image "localhost:7439/docker.io/bitnami/influxdb:2.7.4-debian-11-r0" already present on machine
Normal Created 19s (x3 over 35s) kubelet Created container influxdb-backup
Normal Started 19s (x3 over 35s) kubelet Started container influxdb-backup
Warning BackOff 7s (x3 over 33s) kubelet Back-off restarting failed container influxdb-backup in pod influxdb-backup-28371724-blxj2_gtb-supervision(4d0acec4-94cd-49d2-9165-05aa53e68fa8)
Hi @Fenrur
As per the logs, it seems the PVC is attached to node k8s-3
even though the pod that will use it is scheduled in node k8s-2
. This leads to the error you mentioned, the pod cannot attach a volume that is already attached to another node.
I believe this has something to do with Rook, and there is even a specific entry for under the Volume Attachment section in their common problems and troubleshooting guide.
If any issue exists in attaching the PVC to the application pod first check the volumeattachment object created and also log from csi-attacher sidecar container in provisioner pod
I understood where the problem comes from:
There are permissions problems in the cronjob init container
finding buckets in org primary
backuping _monitoring bucket to /backups/primary/_monitoring
mkdir: cannot create directory '/backups/primary': Permission denied
or
deleting old backups
find: '/backups/lost+found': Permission denied
I imagine that it has a Kubernetes mechanism, that when the cronjob fails to start after a certain threshold, it changes the cronjob of node, and that Rook Ceph does not like the change of node for attach a volume.
Now, why do I have permissions problems?
ceph-filesystem
there is no permissions problems.standard
(from minikube) there is no permissions problems.local-path
(from rancher) there is no permissions problems.❌ With the storageclass ceph-block
there is permissions problems.
In my Rook Ceph Operator Chart, however I have the same configuration for permissions:
csi:
# -- Policy for modifying a volume's ownership or permissions when the RBD PVC is being mounted.
# supported values are documented at https://kubernetes-csi.github.io/docs/support-fsgroup.html
rbdFSGroupPolicy: "File"
# -- Policy for modifying a volume's ownership or permissions when the CephFS PVC is being mounted.
# supported values are documented at https://kubernetes-csi.github.io/docs/support-fsgroup.html
cephFSFSGroupPolicy: "File"
Hi @Fenrur
Now, why do I have permissions problems?
Let's try to see what are the permissions of that volume and figure out what can we do 😁 Please, apply the following modifications to configmap-backup.yaml
:
...
data:
backup.sh: |-
#!/bin/bash
set -e
. /opt/bitnami/scripts/libinfluxdb.sh
DATE="$(date +%Y%m%d_%H%M%S)"
host="{{ include "common.names.fullname" . }}.{{ .Release.Namespace }}.svc"
get_orgs() {
INFLUX_TOKEN="${INFLUXDB_ADMIN_USER_TOKEN}" influx org list --host "http://${host}:{{ coalesce .Values.influxdb.service.ports.http .Values.influxdb.service.port }}" 2> /dev/null | grep -v 'ID' | awk -F '\t' 'BEGIN{ORS=" "} {print $2}'
}
get_databases() {
local org_name="${1:-}"
INFLUX_TOKEN="${INFLUXDB_ADMIN_USER_TOKEN}" influx bucket list --host "http://${host}:{{ coalesce .Values.influxdb.service.ports.http .Values.influxdb.service.port }}" --org "${org_name}" 2> /dev/null | grep -v 'ID' | awk -F '\t' 'BEGIN{ORS=" "} {print $2}'
}
for ORG in $(get_orgs); do
echo "finding buckets in org ${ORG}"
for BUCKET in $(get_databases "${ORG}"); do
backup_dir="{{ .Values.backup.directory }}/${ORG}/${BUCKET}"
+ echo "####### DEBUG #######"
+ ls -la "{{ .Values.backup.directory }}"
+ ls -la "{{ .Values.backup.directory }}/${ORG}"
+ ls -la "$backup_dir"
+ id -a
+ echo "####### END #######"
echo "backuping ${BUCKET} bucket to ${backup_dir}"
mkdir -p "${backup_dir}"
INFLUX_TOKEN="${INFLUXDB_ADMIN_USER_TOKEN}" influx backup --host "http://${host}:{{ coalesce .Values.influxdb.service.ports.http .Values.influxdb.service.port }}" --bucket "${BUCKET}" "${backup_dir}/${DATE}"
done
done
echo "deleting old backups"
find {{ .Values.backup.directory }} -mindepth 3 -maxdepth 3 -not -name ".snapshot" -not -name "lost+found" -type d -mtime +{{ .Values.backup.retentionDays }} -exec rm -r {} \;
...
Share the logs of the init-container after applying the modifications.
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.
Hello @joancafom, I was unavailable for 1 month (can you reopen the issue ?), I'm getting back to the subject of influxdb. This is what I get out of what you asked me:
I modify the file with:
echo "####### DEBUG #######"
ls -la "{{ .Values.backup.directory }}" || true
ls -la "{{ .Values.backup.directory }}/${ORG}" || true
ls -la "$backup_dir" || true
id -a
echo "####### END #######"
Influx backup logs:
####### DEBUG #######
total 24
drwxr-xr-x 3 root root 4096 Jan 15 11:35 .
drwxr-xr-x 1 root root 4096 Jan 15 11:35 ..
drwx------ 2 root root 16384 Jan 15 11:35 lost+found
ls: cannot access '/backups/primary': No such file or directory
ls: cannot access '/backups/primary/_monitoring': No such file or directory
uid=1001 gid=0(root) groups=0(root)
####### END #######
Hi @Fenrur
I do not know the reason why using ceph-block
sets different permissions, but based on your outputs it seems the folder is not writable by the root
group. This is what ultimately leads to the error in your case.
A quick fix for this is to use the containerSecurityContext
feature to make the initContainer run as root, if that is an option for you:
containerSecurityContext:
enabled: true
seLinuxOptions: {}
- runAsUser: 1001
- runAsNonRoot: true
+ runAsUser: 0
+ runAsNonRoot: false
privileged: false
- readOnlyRootFilesystem: true
+ readOnlyRootFilesystem: false
Could you please try it?
Hello @joancafom, Thanks for your help, your changes work! I'm going to look into configuring Rook for permissions.
I wish you a very happy day from France and thank you for giving me your time!
Glad to see it worked! Closing this thread now 😁
Name and Version
bitnami/influxdb 5.11.0
What architecture are you using?
What steps will reproduce the bug?
Are you using any custom parameters or values?
values.yaml
script.bash
What is the expected behavior?
Important Backup Process Consideration
ReadWriteOnce
.InfluxDB
InfluxDB backups
What do you see instead?
Event Pod: influxdb-backup-28365756-t6qgh
Backup pod indicates that it is mount the volume of influxdb, so in lens we watch the opposite
pvc-7da5daa0-e703-405c-8c6f-e59415884864
associeted to PersistanceVolumeClaiminfluxdb
pvc-81bcf421-652b-470d-bd28-9708f602d303
associeted to PersistanceVolumeClaiminfluxdb-backups
Additional information
No response