Closed nickatnceas closed 3 months ago
After spending some time Googling, it looks like Velero is a frequently recommended, open source, free backup options for K8s, which offers (from their website):
I would like to try this option instead of backend RBD snapshot mirroring and rsync since it appears to solve the initial problem of backing up and deleting unused pods from the cluster, and has other features which look useful, such as snapshotting the entire K8s state before upgrades, and being able to restore back to that state.
Some additional information:
Velero uses object storage to store backups and associated artifacts. It also optionally integrates with supported block storage systems to snapshot your persistent volumes. Before beginning the installation process, you should identify the object storage provider and optional block storage provider(s) you’ll be using from the list of compatible providers.
For the block storage provider, it looks like we can use the Container Storage Interface (CSI)
, with a fallback of File System Backup
if that doesn't work.
For the object storage provider, we can setup a Ceph Object Gateway on the Anacapa Ceph cluster. This is another project, but will be useful for learning how to setup a Ceph Object Gateway for production use on the DataONE Ceph cluster later on.
Object storage is now available on the Anacapa Ceph cluster: https://github.nceas.ucsb.edu/NCEAS/Computing/issues/254
I installed Velero on k8s-dev, and after some issues have a partially successful backup:
outin@halt:~/velero/velero-v1.12.0-darwin-amd64$ velero backup describe backup-fsb-6
Name: backup-fsb-6
Namespace: velero
Labels: velero.io/storage-location=default
Annotations: velero.io/resource-timeout=10m0s
velero.io/source-cluster-k8s-gitversion=v1.22.0
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=22
Phase: PartiallyFailed (run `velero backup logs backup-fsb-6` for more information)
Warnings:
Velero: <none>
Cluster: <none>
Namespaces:
hwitw: resource: /pods name: /cdstool-job--1-n7n6n error: /backup for volume hwitw-data is skipped: pod is not in the expected status, name=cdstool-job--1-n7n6n, namespace=hwitw, phase=Failed: pod is not running
resource: /pods name: /cdstool-job--1-q5hj2 error: /backup for volume hwitw-data is skipped: pod is not in the expected status, name=cdstool-job--1-q5hj2, namespace=hwitw, phase=Failed: pod is not running
resource: /pods name: /cdstool-job--1-sdv4x error: /backup for volume hwitw-data is skipped: pod is not in the expected status, name=cdstool-job--1-sdv4x, namespace=hwitw, phase=Failed: pod is not running
resource: /pods name: /cdstool-job--1-sqlrk error: /backup for volume hwitw-data is skipped: pod is not in the expected status, name=cdstool-job--1-sqlrk, namespace=hwitw, phase=Failed: pod is not running
resource: /pods name: /cdstool-job--1-st98r error: /backup for volume hwitw-data is skipped: pod is not in the expected status, name=cdstool-job--1-st98r, namespace=hwitw, phase=Failed: pod is not running
resource: /pods name: /cdstool-job--1-xqp78 error: /backup for volume hwitw-data is skipped: pod is not in the expected status, name=cdstool-job--1-xqp78, namespace=hwitw, phase=Failed: pod is not running
resource: /pods name: /cdstool-job--1-xxnrv error: /backup for volume hwitw-data is skipped: pod is not in the expected status, name=cdstool-job--1-xxnrv, namespace=hwitw, phase=Failed: pod is not running
polder: resource: /pods name: /setup-gleaner--1-5qvwv error: /backup for volume gleaner-context is skipped: pod is not in the expected status, name=setup-gleaner--1-5qvwv, namespace=polder, phase=Failed: pod is not running
resource: /pods name: /setup-gleaner--1-8fwt9 error: /backup for volume gleaner-context is skipped: pod is not in the expected status, name=setup-gleaner--1-8fwt9, namespace=polder, phase=Failed: pod is not running
resource: /pods name: /setup-gleaner--1-bdbzr error: /backup for volume gleaner-context is skipped: pod is not in the expected status, name=setup-gleaner--1-bdbzr, namespace=polder, phase=Failed: pod is not running
resource: /pods name: /setup-gleaner--1-lp2lx error: /backup for volume gleaner-context is skipped: pod is not in the expected status, name=setup-gleaner--1-lp2lx, namespace=polder, phase=Failed: pod is not running
resource: /pods name: /setup-gleaner--1-pd9dm error: /backup for volume gleaner-context is skipped: pod is not in the expected status, name=setup-gleaner--1-pd9dm, namespace=polder, phase=Failed: pod is not running
resource: /pods name: /setup-gleaner--1-swmmc error: /backup for volume gleaner-context is skipped: pod is not in the expected status, name=setup-gleaner--1-swmmc, namespace=polder, phase=Failed: pod is not running
resource: /pods name: /setup-gleaner--1-vffcn error: /backup for volume gleaner-context is skipped: pod is not in the expected status, name=setup-gleaner--1-vffcn, namespace=polder, phase=Failed: pod is not running
Errors:
Velero: name: /hwitw-7c7b669857-lz7vc error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
name: /d1index-idxworker-5fd67f856b-hdbvx error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
name: /metadig-controller-595d76dc6c-j6ht7 error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
name: /dev-gleaner-74bf949f4f-555nm error: /pod volume backup failed: data path backup failed: Failed to run kopia backup: Error when processing data/repositories/polder/storage/pos: ConcatenateObjects is not supported
Error when processing data/repositories/polder/storage/pso: ConcatenateObjects is not supported
name: /dev-gleaner-74bf949f4f-555nm error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
Cluster: <none>
Namespaces: <none>
Namespaces:
Included: *
Excluded: <none>
Resources:
Included: *
Excluded: <none>
Cluster-scoped: auto
Label selector: <none>
Storage Location: default
Velero-Native Snapshot PVs: auto
TTL: 720h0m0s
CSISnapshotTimeout: 10m0s
ItemOperationTimeout: 4h0m0s
Hooks: <none>
Backup Format Version: 1.1.0
Started: 2023-09-29 13:28:45 -0700 PDT
Completed: 2023-09-29 14:22:19 -0700 PDT
Expiration: 2023-10-29 13:28:45 -0700 PDT
Total items to be backed up: 2206
Items backed up: 2206
Velero-Native Snapshots: <none included>
kopia Backups (specify --details for more information):
Completed: 66
Failed: 5
I started docs at https://github.com/DataONEorg/k8s-cluster/blob/main/admin/backup.md
I ran new backups and the previous errors are gone, however some new pods are reporting errors:
Errors:
Velero: name: /hwitw-5d876bf94c-khh26 error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
name: /d1index-idxworker-5fd67f856b-hdbvx error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
name: /metadig-controller-595d76dc6c-j6ht7 error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
Cluster: <none>
Namespaces: <none>
I ran two backups, and the same three pods failed in both. When viewing the pods with kubectl they show as Running
, for example:
outin@halt:~/velero/velero-v1.12.0-darwin-amd64$ kubectl get pod -n hwitw hwitw-5d876bf94c-khh26 -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
hwitw-5d876bf94c-khh26 1/1 Running 0 3d17h 192.168.108.38 k8s-dev-node-1 <none> <none>
While I don't yet understand why these backups are failing, It's possible these are transient errors, I'm going to try the backup again in a few days to see if they are cleared again.
I also setup a nightly backup schedule:
velero schedule create k8s-dev-daily --schedule="0 1 * * *"
I need to reconfigure Velero K8s backups as the Anacapa Ceph cluster has been shut down. My current plan is to install Minio on host-ucsb-26 at Anacapa to use it to receive backup from Velero (Velero requires backing up to object storage). I started setting up host-ucsb-26 today, and hope to get K8s backups running next week.
After spending the time to enable RBD and CephFS snapshots on the two K8s clusters I setup snapshot-based backups with Velero. I tested these successfully with Velero using the csi-rbd-sc storage class, however, PVs without a storage class (ie any that have been created manually, such as gnis/cephfs-gnis-pvc
do not have snapshot support and were not backed up. Adding snapshot support to all manually created PVs seems time consuming and not practical.
Errors:
Velero: name: /gnis-74c6f7c6df-8vzf2 message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=gnis, name=cephfs-gnis-pvc): rpc error: code = Unknown desc = Cannot snapshot PVC gnis/cephfs-gnis-pvc, PVC has no storage class.
I switched to Velero FSB backups, which can backup all types of PVs, but don’t have the consistency of snapshots. I have not figured out a way to get Velero to use both types of backups, but that may be an option in the future. With FSB instead of CSI snapshots the backup of the GNIS namespace completed successfully. I updated the docs with both CSI and FSB install commands.
I started a full backup of the k8s-dev cluster and was able to back up the entire k8s-dev cluster except for the following:
Errors:
Velero: name: /ekbrooke-elasticsearch-data-1 message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
name: /metacatbrooke-dataone-indexer-845fd4c5f5-cnwc2 message: /Error backing up item error: /pod volume backup failed: error exposing host path for pod volume: error identifying unique volume path on host for volume metacatbrooke-temp-tripledb-volume in pod metacatbrooke-dataone-indexer-845fd4c5f5-cnwc2: expected one matching path: /host_pods/e23f6cc5-acfe-411f-8ea9-69fb2bc075c8/volumes/*/metacatbrooke-temp-tripledb-volume, got 0
name: /metacatbrooke-dataone-indexer-845fd4c5f5-rzzcw message: /Error backing up item error: /pod volume backup failed: error exposing host path for pod volume: error identifying unique volume path on host for volume metacatbrooke-temp-tripledb-volume in pod metacatbrooke-dataone-indexer-845fd4c5f5-rzzcw: expected one matching path: /host_pods/0bb2f50c-04a0-4a1d-9559-49aa7f73a80e/volumes/*/metacatbrooke-temp-tripledb-volume, got 0
name: /metacatbrooke-dataone-indexer-845fd4c5f5-vkbcx message: /Error backing up item error: /pod volume backup failed: error exposing host path for pod volume: error identifying unique volume path on host for volume metacatbrooke-temp-tripledb-volume in pod metacatbrooke-dataone-indexer-845fd4c5f5-vkbcx: expected one matching path: /host_pods/204f09ca-7e9f-402d-b0e2-12f5b62257b9/volumes/*/metacatbrooke-temp-tripledb-volume, got 0
name: /hwitw-67ccd577-rltrp message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
name: /d1index-idxworker-5fd67f856b-hdbvx message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
name: /metadig-controller-5ddff7d9fb-jxx77 message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
I started checking on hwitw-67ccd577-rltrp and found OOM killed error messages on the host server. This led me to the docs to discover that Velero default memory config is for 100 GiB of data, and cephfs-hwitw-0
maps to /volumes/hwitw-subvol-group/hwitw-subvol/7cb7d655-7ba9-49d2-8dd6-c83a47ff38a1
, which contains about 14 TiB of data.
I increased the default memory limits 8x and restarted the backup for hwitw. It appears to be working, it has not run out of memory quickly like it did during the test runs. I'm going to let it run over the weekend out of curiousity, but will probably end up skipping this volume in Velero, and backing up with rsync instead (if it even needs to be backed up).
Increasing the pod memory settings for velero and its agents appears to have fixed the error with large backups, and I was able to finish a backup of the hwitw
namespace, of about 14 TiB:
outin@halt:~/velero$ velero backup get hwitw-backup-5
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
hwitw-backup-5 Completed 0 0 2024-02-26 09:22:05 -0800 PST 29d default <none>
outin@halt:~$ mc du -r velero/k8s-dev
...
257KiB 12 objects k8s-dev/backups/hwitw-backup-5
14TiB 710551 objects k8s-dev/kopia/hwitw
...
I started a second incremental backup run of the entire cluster.
A full backup of K8s-dev completed except for three resources:
Failed:
brooke/metacatbrooke-dataone-indexer-845fd4c5f5-cnwc2: metacatbrooke-temp-tripledb-volume
brooke/metacatbrooke-dataone-indexer-845fd4c5f5-rzzcw: metacatbrooke-temp-tripledb-volume
brooke/metacatbrooke-dataone-indexer-845fd4c5f5-vkbcx: metacatbrooke-temp-tripledb-volume
Longer error messages are:
Errors:
Velero: name: /metacatbrooke-dataone-indexer-845fd4c5f5-cnwc2 message: /Error backing up item error: /pod volume backup failed: error exposing host path for pod volume: error identifying unique volume path on host for volume metacatbrooke-temp-tripledb-volume in pod metacatbrooke-dataone-indexer-845fd4c5f5-cnwc2: expected one matching path: /host_pods/e23f6cc5-acfe-411f-8ea9-69fb2bc075c8/volumes/*/metacatbrooke-temp-tripledb-volume, got 0
name: /metacatbrooke-dataone-indexer-845fd4c5f5-rzzcw message: /Error backing up item error: /pod volume backup failed: error exposing host path for pod volume: error identifying unique volume path on host for volume metacatbrooke-temp-tripledb-volume in pod metacatbrooke-dataone-indexer-845fd4c5f5-rzzcw: expected one matching path: /host_pods/0bb2f50c-04a0-4a1d-9559-49aa7f73a80e/volumes/*/metacatbrooke-temp-tripledb-volume, got 0
name: /metacatbrooke-dataone-indexer-845fd4c5f5-vkbcx message: /Error backing up item error: /pod volume backup failed: error exposing host path for pod volume: error identifying unique volume path on host for volume metacatbrooke-temp-tripledb-volume in pod metacatbrooke-dataone-indexer-845fd4c5f5-vkbcx: expected one matching path: /host_pods/204f09ca-7e9f-402d-b0e2-12f5b62257b9/volumes/*/metacatbrooke-temp-tripledb-volume, got 0
Cluster: <none>
Namespaces: <none>
I'm looking into fixing these errors.
It appears that the pod volumes have an issue with their storage:
$ kubectl describe pod metacatbrooke-dataone-indexer-845fd4c5f5-cnwc2 -n brooke
...
metacatbrooke-temp-tripledb-volume:
Type: EphemeralVolume (an inline specification for a volume that gets created and deleted with the pod)
StorageClass: csi-cephfs-sc-ephemeral
Volume:
Labels: <none>
Annotations: <none>
Capacity:
Access Modes:
VolumeMode: Filesystem
...
I'll check with @artntek before proceeding...
@nickatnceas - those volumes are ephemeral, basically acting as a short-term local cache - by definition, they are non-critical and will be regenerated as needed. If there's a way of excluding them from the backups, that would be the best bet, I think.
I excluded the three pod volumes from backup and the namespace backup for brooke
now completes successfully.
velero backup create brooke-backup-1 --include-namespaces brooke
kubectl -n brooke annotate pod/metacatbrooke-dataone-indexer-845fd4c5f5-cnwc2 backup.velero.io/backup-volumes-excludes=metacatbrooke-temp-tripledb-volume
velero backup create brooke-backup-2 --include-namespaces brooke
kubectl -n brooke annotate pod/metacatbrooke-dataone-indexer-845fd4c5f5-rzzcw backup.velero.io/backup-volumes-excludes=metacatbrooke-temp-tripledb-volume
kubectl -n brooke annotate pod/metacatbrooke-dataone-indexer-845fd4c5f5-vkbcx backup.velero.io/backup-volumes-excludes=metacatbrooke-temp-tripledb-volume
velero backup create brooke-backup-3 --include-namespaces brooke
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
brooke-backup-1 PartiallyFailed 3 0 2024-02-27 16:11:08 -0800 PST 29d default <none>
brooke-backup-2 PartiallyFailed 2 0 2024-02-27 16:15:33 -0800 PST 29d default <none>
brooke-backup-3 Completed 0 0 2024-02-27 16:21:39 -0800 PST 29d default <none>
I added the exclusion instructions to the backup docs, and started another full namespace backup.
A full backup run reported it completed!
There are a couple of warnings in the backup log, but they appear to be for broken pods, which are probably safe to ignore:
$ velero backup create full-backup-4
Backup request "full-backup-4" submitted successfully.
Run `velero backup describe full-backup-4` or `velero backup logs full-backup-4` for more details.
$ velero get backups
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SE
full-backup-1 PartiallyFailed 7 0 2024-02-23 11:51:44 -0800 PST 25d default <none>
full-backup-2 PartiallyFailed 3 2 2024-02-26 09:41:42 -0800 PST 28d default <none>
full-backup-3 PartiallyFailed 3 2 2024-02-27 10:54:16 -0800 PST 29d default <none>
full-backup-4 Completed 0 2 2024-02-27 16:24:51 -0800 PST 29d default <none>
$ velero backup describe full-backup-4
Name: full-backup-4
Namespace: velero
Labels: velero.io/storage-location=default
Annotations: velero.io/resource-timeout=10m0s
velero.io/source-cluster-k8s-gitversion=v1.22.0
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=22
Phase: Completed
Warnings:
Velero: <none>
Cluster: <none>
Namespaces:
pdgrun: resource: /pods name: /parsl-worker-1708968600101 message: /Skip pod volume pdgrun-dev-0 error: /pod is not in the expected status, name=parsl-worker-1708968600101, namespace=pdgrun, phase=Pending: pod is not running
resource: /pods name: /parsl-worker-1708968600248 message: /Skip pod volume pdgrun-dev-0 error: /pod is not in the expected status, name=parsl-worker-1708968600248, namespace=pdgrun, phase=Pending: pod is not running
Namespaces:
Included: *
Excluded: <none>
Resources:
Included: *
Excluded: <none>
Cluster-scoped: auto
Label selector: <none>
Or label selector: <none>
Storage Location: default
Velero-Native Snapshot PVs: auto
Snapshot Move Data: false
Data Mover: velero
TTL: 720h0m0s
CSISnapshotTimeout: 10m0s
ItemOperationTimeout: 4h0m0s
Hooks: <none>
Backup Format Version: 1.1.0
Started: 2024-02-27 16:24:51 -0800 PST
Completed: 2024-02-27 16:39:35 -0800 PST
Expiration: 2024-03-28 17:24:51 -0700 PDT
Total items to be backed up: 2654
Items backed up: 2654
Backup Volumes:
Velero-Native Snapshots: <none included>
CSI Snapshots: <none included>
Pod Volume Backups - kopia (specify --details for more information):
Completed: 85
HooksAttempted: 0
HooksFailed: 0
$ kubectl get pods -n pdgrun
NAME READY STATUS RESTARTS AGE
parsl-worker-1708968600101 0/1 InvalidImageName 0 31h
parsl-worker-1708968600248 0/1 InvalidImageName 0 31h
I made the following changes:
I updated the first comment to include new ticket requirements.
Received the following errors after running the full backup of k8s-prod:
Errors:
Velero: name: /gnis-6c7f9d9bb7-8mg4j message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
name: /metadig-controller-7db96b7585-zb2dk message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
name: /metadig-solr-0 message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
name: /prod-gleaner-76df9dfc54-rp9x8 message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
name: /prod-gleaner-76df9dfc54-rp9x8 message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
Failed:
gnis/gnis-6c7f9d9bb7-8mg4j: gnis-volume
metadig/metadig-controller-7db96b7585-zb2dk: metadig-pv
metadig/metadig-solr-0: data
polder/prod-gleaner-76df9dfc54-rp9x8: s3system-volume, triplestore-volume
I was able to fix the failing backup by increasing the memory limits on Velero and its pods to double of what I used on k8s-dev:
kubectl patch daemonset node-agent -n velero --patch '{"spec":{"template":{"spec":{"containers":[{"name": "node-agent", "resources": {"limits":{"cpu": "4", "memory": "16384Mi"}, "requests": {"cpu": "2", "memory": "8192Mi"}}}]}}}}'
kubectl patch deployment velero -n velero --patch '{"spec":{"template":{"spec":{"containers":[{"name": "velero", "resources": {"limits":{"cpu": "4", "memory": "8192Mi"}, "requests": {"cpu": "2", "memory": "2048Mi"}}}]}}}}'
I also set a nightly backup schedule with 90 days of retention, and the first scheduled backup ran successfully last night:
outin@halt:~$ velero schedule get
NAME STATUS CREATED SCHEDULE BACKUP TTL LAST BACKUP SELECTOR PAUSED
full-backup Enabled 2024-03-20 17:12:47 -0700 PDT 0 1 * * * 2160h0m0s 22h ago <none> false
outin@halt:~$ velero backup get
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
full-backup-20240321010020 Completed 0 33 2024-03-20 18:00:20 -0700 PDT 89d default <none>
...
I setup backup monitoring on optimal-squirrel.nceas.ucsb.edu
to alert when Velero fails to complete a backup, or if the backups silently stop running, for the nightly backups of k8s-prod and k8s-dev.
The backup scripts are in https://github.nceas.ucsb.edu/outin/check_velero_backups
Check_MK alerts:
We would like to have both K8s clusters, k8s-prod and k8s-dev, backed up to prevent data loss in the event of hardware failure, human error, malicious actors, etc.
Since we are using Velero in File System Backup mode the underlying storage does not matter, and this earlier requirements list can be ignored:
libvirt-pool
k8s-pool-ec42-*
k8sdev-pool-ec42-*
cephfs
We currently are backing up the VM images in
libvirt-pool
. Before moving forward on some other issues, like #1, we should be able to restore a broken cluster from a backup.Backups: