Lirt / velero-plugin-for-openstack

Openstack Cinder, Manila and Swift plugin for Velero backups
MIT License
27 stars 13 forks source link

Unable to create a backup in openstack - temporary URL/volume in-use issue #104

Closed jguipi closed 7 months ago

jguipi commented 7 months ago

Describe the bug I'm deploying Velcro using helm chart, and I'm not able to successfully create a backup. They are always in the PartiallyFailed state.

Steps to reproduce the behavior

  1. Deploy using helm chart
  2. Create a backup using :
  3. velero backup create test4  --include-namespaces=default --default-volumes-to-fs-backup --snapshot-volumes --ttl 30m

Expected behavior A successful backup is created and uploaded to openstack Shared Object Storage.

Used versions

Link to velero or backup log

time="2024-02-18T03:28:42Z" level=info msg="Authentication will be done for cloud main" cmd=/plugins/velero-plugin-for-openstack controller=download-request downloadRequest=velero/test4-4036bc95-faac-472e-ac17-f16b167c98cc logSource="/go/src/github.com/Lirt/velero-plugin-for-openstack/src/utils/auth.go:33" pluginName=velero-plugin-for-openstack
time="2024-02-18T03:28:42Z" level=info msg="Trying to authenticate against OpenStack using environment variables (including application credentials) or using files ~/.config/openstack/clouds.yaml, /etc/openstack/clouds.yaml and ./clouds.yaml" cmd=/plugins/velero-plugin-for-openstack controller=download-request downloadRequest=velero/test4-4036bc95-faac-472e-ac17-f16b167c98cc logSource="/go/src/github.com/Lirt/velero-plugin-for-openstack/src/utils/auth.go:68" pluginName=velero-plugin-for-openstack
time="2024-02-18T03:28:42Z" level=info msg="Authentication against identity endpoint https://identity-3.eu-nl-1.cloud.sap/v3/ was successful" cmd=/plugins/velero-plugin-for-openstack controller=download-request downloadRequest=velero/test4-4036bc95-faac-472e-ac17-f16b167c98cc logSource="/go/src/github.com/Lirt/velero-plugin-for-openstack/src/utils/auth.go:113" pluginName=velero-plugin-for-openstack
time="2024-02-18T03:28:42Z" level=info msg="Successfully created object storage service client" cmd=/plugins/velero-plugin-for-openstack controller=download-request downloadRequest=velero/test4-4036bc95-faac-472e-ac17-f16b167c98cc logSource="/go/src/github.com/Lirt/velero-plugin-for-openstack/src/swift/object_store.go:66" pluginName=velero-plugin-for-openstack region=eu-nl-1
time="2024-02-18T03:28:42Z" level=info msg="ObjectStore.CreateSignedURL called" cmd=/plugins/velero-plugin-for-openstack container=velero-backups-can controller=download-request downloadRequest=velero/test4-4036bc95-faac-472e-ac17-f16b167c98cc logSource="/go/src/github.com/Lirt/velero-plugin-for-openstack/src/swift/object_store.go:252" object=backups/test4/test4-results.gz pluginName=velero-plugin-for-openstack ttl=6e+11
time="2024-02-18T03:28:42Z" level=warning msg="fail to get Backup metadata file's download URL {BackupResults test4}, retry later: rpc error: code = Unknown desc = failed to create temporary URL for \"backups/test4/test4-results.gz\" object in \"velero-backups-can\" container: Unable to obtain the Temp URL key." controller=download-request downloadRequest=velero/test4-4036bc95-faac-472e-ac17-f16b167c98cc logSource="pkg/controller/download_request_controller.go:206"
time="2024-02-18T03:28:43Z" level=error msg="Reconciler error" controller=downloadrequest controllerGroup=velero.io controllerKind=DownloadRequest downloadRequest="{\"name\":\"test4-4036bc95-faac-472e-ac17-f16b167c98cc\",\"namespace\":\"velero\"}" error="rpc error: code = Unknown desc = failed to create temporary URL for \"backups/test4/test4-results.gz\" object in \"velero-backups-can\" container: Unable to obtain the Temp URL key." error.file="/go/src/github.com/vmware-tanzu/velero/pkg/controller/download_request_controller.go:207" error.function="github.com/vmware-tanzu/velero/pkg/controller.(*downloadRequestReconciler).Reconcile" logSource="/go/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:123" name=test4-4036bc95-faac-472e-ac17-f16b167c98cc namespace=velero reconcileID="\"1041cd96-1e2c-4ad0-823d-2d0c1438b387\""
time="2024-02-18T03:28:43Z" level=info msg="ObjectStore.Init called" cmd=/plugins/velero-plugin-for-openstack config="map[bucket:velero-backups-can cloud:main prefix: region:eu-nl-1]" controller=download-request downloadRequest=velero/test4-4036bc95-faac-472e-ac17-f16b167c98cc logSource="/go/src/github.com/Lirt/velero-plugin-for-openstack/src/swift/object_store.go:38" pluginName=velero-plugin-for-openstack
time="2024-02-18T03:28:43Z" level=info msg="Authentication will be done for cloud main" cmd=/plugins/velero-plugin-for-openstack controller=download-request downloadRequest=velero/test4-4036bc95-faac-472e-ac17-f16b167c98cc logSource="/go/src/github.com/Lirt/velero-plugin-for-openstack/src/utils/auth.go:33" pluginName=velero-plugin-for-openstack
time="2024-02-18T03:28:43Z" level=info msg="Trying to authenticate against OpenStack using environment variables (including application credentials) or using files ~/.config/openstack/clouds.yaml, /etc/openstack/clouds.yaml and ./clouds.yaml" cmd=/plugins/velero-plugin-for-openstack controller=download-request downloadRequest=velero/test4-4036bc95-faac-472e-ac17-f16b167c98cc logSource="/go/src/github.com/Lirt/velero-plugin-for-openstack/src/utils/auth.go:68" pluginName=velero-plugin-for-openstack

My guess is that it's related to that step but I'm not able to exec into the container to run those command

jguipi commented 7 months ago

Ended up founding another way to do backup, and now I'm unable to create a successfull backup due to volume in-use :

command : velero backup create test5 --include-namespaces=default

time="2024-02-18T05:02:15Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-70c18c78-6e80-443d-a967-ab24bc97c370 to have snapshot handle. Retrying in 5s" backup=velero/test5 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:259" pluginName=velero-plugin-for-csi
time="2024-02-18T05:02:15Z" level=warning msg="Volumesnapshotcontent snapcontent-70c18c78-6e80-443d-a967-ab24bc97c370 has error: Failed to check and update snapshot content: failed to take snapshot of the volume c91638ec-6362-4e15-b3cd-e2f957e3e6b5: \"rpc error: code = Internal desc = CreateSnapshot failed with error Bad request with: [POST https://volume-3.eu-nl-1.cloud.sap:443/v3/ddcde3b2cae24ea0a85ad5b608b7ac97/snapshots], error message: {\\\"badRequest\\\": {\\\"code\\\": 400, \\\"message\\\": \\\"Invalid volume: Volume c91638ec-6362-4e15-b3cd-e2f957e3e6b5 status must be available, but current status is: in-use.\\\"}}\"" backup=velero/test5 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:261" pluginName=velero-plugin-for-csi
time="2024-02-18T05:02:20Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-70c18c78-6e80-443d-a967-ab24bc97c370 to have snapshot handle. Retrying in 5s" backup=velero/test5 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:259" pluginName=velero-plugin-for-csi
time="2024-02-18T05:02:20Z" level=warning msg="Volumesnapshotcontent snapcontent-70c18c78-6e80-443d-a967-ab24bc97c370 has error: Failed to check and update snapshot content: failed to take snapshot of the volume c91638ec-6362-4e15-b3cd-e2f957e3e6b5: \"rpc error: code = Internal desc = CreateSnapshot failed with error Bad request with: [POST https://volume-3.eu-nl-1.cloud.sap:443/v3/ddcde3b2cae24ea0a85ad5b608b7ac97/snapshots], error message: {\\\"badRequest\\\": {\\\"code\\\": 400, \\\"message\\\": \\\"Invalid volume: Volume c91638ec-6362-4e15-b3cd-e2f957e3e6b5 status must be available, but current status is: in-use.\\\"}}\"" backup=velero/test5 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:261" pluginName=velero-plugin-for-csi
Lirt commented 7 months ago

Hello,

Thank you for opening issue.

Regarding the first error, you need to follow documentation for swift container setup to make plugin to successfully create temporary URL:

time="2024-02-18T03:28:42Z" level=warning msg="fail to get Backup metadata file's download URL {BackupResults test4}, retry later: rpc error: code = Unknown desc = failed to create temporary URL for \"backups/test4/test4-results.gz\" object in \"velero-backups-can\" container: Unable to obtain the Temp URL key." controller=download-request downloadRequest=velero/test4-4036bc95-faac-472e-ac17-f16b167c98cc logSource="pkg/controller/download_request_controller.go:206"

Those swift commands don't need to be executed from inside a container. You can run them locally, but you need to download your Openstack/Swift credentials, source them and then run the commands. Of course you need to install python-swiftclient using pip to run the swift command.

For volume in-use the snapshot should work, because it's doing --force as mentioned in docs:

The snapshots are done using flag --force. The reason is that volumes in state in-use cannot be snapshotted without it (they would need to be detached in advance). In some cases this can make snapshot contents inconsistent!

Can you write me definition of your BackupStorageLocation? From log message you pasted it seems like you are doing snapshot using CSI driver, not this plugin.

backup=velero/test5 cmd=/plugins/velero-plugin-for-csi
logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:261"
pluginName=velero-plugin-for-csi
jguipi commented 7 months ago

Thank you for the fast response.

here's my full value.yaml file that I applied :

##
## Configuration settings related to Velero installation namespace
##

# Labels settings in namespace
namespace:
  labels: {}
    # Enforce Pod Security Standards with Namespace Labels
    # https://kubernetes.io/docs/tasks/configure-pod-container/enforce-standards-namespace-labels/
    # - key: pod-security.kubernetes.io/enforce
    #   value: privileged
    # - key: pod-security.kubernetes.io/enforce-version
    #   value: latest
    # - key: pod-security.kubernetes.io/audit
    #   value: privileged
    # - key: pod-security.kubernetes.io/audit-version
    #   value: latest
    # - key: pod-security.kubernetes.io/warn
    #   value: privileged
    # - key: pod-security.kubernetes.io/warn-version
    #   value: latest

##
## End of namespace-related settings.
##

##
## Configuration settings that directly affect the Velero deployment YAML.
##

# Details of the container image to use in the Velero deployment & daemonset (if
# enabling node-agent). Required.
image:
  repository: velero/velero
  tag: v1.13.0
  # Digest value example: sha256:d238835e151cec91c6a811fe3a89a66d3231d9f64d09e5f3c49552672d271f38.
  # If used, it will take precedence over the image.tag.
  # digest:
  pullPolicy: IfNotPresent
  # One or more secrets to be used when pulling images
  imagePullSecrets: []
  # - registrySecretName

nameOverride: ""
fullnameOverride: ""

# Annotations to add to the Velero deployment's. Optional.
#
# If you are using reloader use the following annotation with your VELERO_SECRET_NAME
annotations: {}
# secret.reloader.stakater.com/reload: "<VELERO_SECRET_NAME>"

# Annotations to add to secret
secretAnnotations: {}

# Labels to add to the Velero deployment's. Optional.
labels: {}

# Annotations to add to the Velero deployment's pod template. Optional.
#
# If using kube2iam or kiam, use the following annotation with your AWS_ACCOUNT_ID
# and VELERO_ROLE_NAME filled in:
podAnnotations: {}
  #  iam.amazonaws.com/role: "arn:aws:iam::<AWS_ACCOUNT_ID>:role/<VELERO_ROLE_NAME>"

# Additional pod labels for Velero deployment's template. Optional
# ref: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
podLabels: {}

# Number of old history to retain to allow rollback (If not set, default Kubernetes value is set to 10)
# revisionHistoryLimit: 1

# Resource requests/limits to specify for the Velero deployment.
# https://velero.io/docs/v1.6/customize-installation/#customize-resource-requests-and-limits
resources:
  requests:
    cpu: 500m
    memory: 128Mi
  limits:
    cpu: 1000m
    memory: 512Mi

# Resource requests/limits to specify for the upgradeCRDs job pod. Need to be adjusted by user accordingly.
upgradeJobResources: {}
# requests:
#     cpu: 50m
#     memory: 128Mi
#   limits:
#     cpu: 100m
#     memory: 256Mi
upgradeCRDsJob:
  # Extra volumes for the Upgrade CRDs Job. Optional.
  extraVolumes: []
  # Extra volumeMounts for the Upgrade CRDs Job. Optional.
  extraVolumeMounts: []
  # Extra key/value pairs to be used as environment variables. Optional.
  extraEnvVars: {}

# Configure the dnsPolicy of the Velero deployment
# See: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy
dnsPolicy: ClusterFirst

# Init containers to add to the Velero deployment's pod spec. At least one plugin provider image is required.
# If the value is a string then it is evaluated as a template.
initContainers:
  - name: velero-plugin-for-csi
    image: velero/velero-plugin-for-csi:v0.7.0
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins
  - name: velero-plugin-for-openstack
    image: lirt/velero-plugin-for-openstack:v0.6.0
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins

# SecurityContext to use for the Velero deployment. Optional.
# Set fsGroup for `AWS IAM Roles for Service Accounts`
# see more informations at: https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html
podSecurityContext: {}
  # fsGroup: 1337

# Container Level Security Context for the 'velero' container of the Velero deployment. Optional.
# See: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-the-security-context-for-a-container
containerSecurityContext: {}
  # allowPrivilegeEscalation: false
  # capabilities:
  #   drop: ["ALL"]
  #   add: []
  # readOnlyRootFilesystem: true

# Container Lifecycle Hooks to use for the Velero deployment. Optional.
lifecycle: {}

# Pod priority class name to use for the Velero deployment. Optional.
priorityClassName: ""

# The number of seconds to allow for graceful termination of the pod. Optional.
terminationGracePeriodSeconds: 3600

# Liveness probe of the pod
livenessProbe:
  httpGet:
    path: /metrics
    port: http-monitoring
    scheme: HTTP
  initialDelaySeconds: 10
  periodSeconds: 30
  timeoutSeconds: 5
  successThreshold: 1
  failureThreshold: 5

# Readiness probe of the pod
readinessProbe:
  httpGet:
    path: /metrics
    port: http-monitoring
    scheme: HTTP
  initialDelaySeconds: 10
  periodSeconds: 30
  timeoutSeconds: 5
  successThreshold: 1
  failureThreshold: 5

# Tolerations to use for the Velero deployment. Optional.
tolerations: []

# Affinity to use for the Velero deployment. Optional.
affinity: {}

# Node selector to use for the Velero deployment. Optional.
nodeSelector: {}

# DNS configuration to use for the Velero deployment. Optional.
dnsConfig: {}

# Extra volumes for the Velero deployment. Optional.
extraVolumes: 
- name: cloud-config-velero
  secret:
    secretName: velero-credentials
    items:
    - key: clouds.yaml
      path: clouds.yaml

# Extra volumeMounts for the Velero deployment. Optional.
extraVolumeMounts:
- name: cloud-config-velero
  mountPath: /etc/openstack/clouds.yaml
  readOnly: true
  subPath: clouds.yaml

# Extra K8s manifests to deploy
extraObjects: []
  # - apiVersion: secrets-store.csi.x-k8s.io/v1
  #   kind: SecretProviderClass
  #   metadata:
  #     name: velero-secrets-store
  #   spec:
  #     provider: aws
  #     parameters:
  #       objects: |
  #         - objectName: "velero"
  #           objectType: "secretsmanager"
  #           jmesPath:
  #               - path: "access_key"
  #                 objectAlias: "access_key"
  #               - path: "secret_key"
  #                 objectAlias: "secret_key"
  #               - path: "region"
  #                 objectAlias: "region"
  #     secretObjects:
  #       - data:
  #         - key: access_key
  #           objectName: client-id
  #         - key: client-secret
  #           objectName: client-secret
  #         - key: region
  #           objectName: client-secret
  #         secretName: velero-secrets-store
  #         type: Opaque

# Settings for Velero's prometheus metrics. Enabled by default.
metrics:
  enabled: true
  scrapeInterval: 30s
  scrapeTimeout: 10s

  # service metdata if metrics are enabled
  service:
    annotations: {}
    labels: {}

  # Pod annotations for Prometheus
  podAnnotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8085"
    prometheus.io/path: "/metrics"

  serviceMonitor:
    autodetect: true
    enabled: false
    annotations: {}
    additionalLabels: {}

    # metrics.serviceMonitor.metricRelabelings Specify Metric Relabelings to add to the scrape endpoint
    # ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#relabelconfig
    # metricRelabelings: []
    # metrics.serviceMonitor.relabelings [array] Prometheus relabeling rules
    # relabelings: []
    # ServiceMonitor namespace. Default to Velero namespace.
    # namespace:
    # ServiceMonitor connection scheme. Defaults to HTTP.
    # scheme: ""
    # ServiceMonitor connection tlsConfig. Defaults to {}.
    # tlsConfig: {}
  nodeAgentPodMonitor:
    autodetect: true
    enabled: false
    annotations: {}
    additionalLabels: {}
    # ServiceMonitor namespace. Default to Velero namespace.
    # namespace:
    # ServiceMonitor connection scheme. Defaults to HTTP.
    # scheme: ""
    # ServiceMonitor connection tlsConfig. Defaults to {}.
    # tlsConfig: {}

  prometheusRule:
    autodetect: true
    enabled: false
    # Additional labels to add to deployed PrometheusRule
    additionalLabels: {}
    # PrometheusRule namespace. Defaults to Velero namespace.
    # namespace: ""
    # Rules to be deployed
    spec: []
    # - alert: VeleroBackupPartialFailures
    #   annotations:
    #     message: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} partialy failed backups.
    #   expr: |-
    #     velero_backup_partial_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
    #   for: 15m
    #   labels:
    #     severity: warning
    # - alert: VeleroBackupFailures
    #   annotations:
    #     message: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} failed backups.
    #   expr: |-
    #     velero_backup_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
    #   for: 15m
    #   labels:
    #     severity: warning

kubectl:
  image:
    repository: docker.io/bitnami/kubectl
    # Digest value example: sha256:d238835e151cec91c6a811fe3a89a66d3231d9f64d09e5f3c49552672d271f38.
    # If used, it will take precedence over the kubectl.image.tag.
    # digest:
    # kubectl image tag. If used, it will take precedence over the cluster Kubernetes version.
    # tag: 1.16.15
  # Container Level Security Context for the 'kubectl' container of the crd jobs. Optional.
  # See: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-the-security-context-for-a-container
  containerSecurityContext: {}
  # Resource requests/limits to specify for the upgrade/cleanup job. Optional
  resources: {}
  # Annotations to set for the upgrade/cleanup job. Optional.
  annotations: {}
  # Labels to set for the upgrade/cleanup job. Optional.
  labels: {}

# This job upgrades the CRDs.
upgradeCRDs: true

# This job is meant primarily for cleaning up CRDs on CI systems.
# Using this on production systems, especially those that have multiple releases of Velero, will be destructive.
cleanUpCRDs: false

##
## End of deployment-related settings.
##

##
## Parameters for the `default` BackupStorageLocation and VolumeSnapshotLocation,
## and additional server settings.
##
configuration:
  # Parameters for the BackupStorageLocation(s). Configure multiple by adding other element(s) to the backupStorageLocation slice.
  # See https://velero.io/docs/v1.6/api-types/backupstoragelocation/
  backupStorageLocation:
    # name is the name of the backup storage location where backups should be stored. If a name is not provided,
    # a backup storage location will be created with the name "default". Optional.
    # provider is the name for the backup storage location provider.
  - name: 
    provider: community.openstack.org/openstack
    # bucket is the name of the bucket to store backups in. Required.
    bucket: velero-backups-can
    # caCert defines a base64 encoded CA bundle to use when verifying TLS connections to the provider. Optional.
    caCert:
    # prefix is the directory under which all Velero data should be stored within the bucket. Optional.
    prefix:
    # default indicates this location is the default backup storage location. Optional.
    default: true
    # validationFrequency defines how frequently Velero should validate the object storage. Optional.
    validationFrequency:
    # accessMode determines if velero can write to this backup storage location. Optional.
    # default to ReadWrite, ReadOnly is used during migrations and restores.
    accessMode: ReadWrite
    credential:
      # name of the secret used by this backupStorageLocation.
      # name: velero-credentials
      # name of key that contains the secret data to be used.
      # key: clouds
      # auth_url: "https://identity.com/v3"
    # Additional provider-specific configuration. See link above
    # for details of required/optional fields for your provider.
    config:
     cloud: main
     auth_url: "https://identity.com/v3"
     region: "eu-nl-1"
    #  resticRepoPrefix: swift:containers:/restic
    #  domain-name: "cp"
    #  tenant-name: "gardener-canary-team"
    #  s3ForcePathStyle:
    #  s3Url:
    #  kmsKeyId:
    #  resourceGroup:
    #  The ID of the subscription containing the storage account, if different from the cluster’s subscription. (Azure only)
    #  subscriptionId:
    #  storageAccount:
    #  publicUrl:
    #  Name of the GCP service account to use for this backup storage location. Specify the
    #  service account here if you want to use workload identity instead of providing the key file.(GCP only)
    #  serviceAccount:
    #  Option to skip certificate validation or not if insecureSkipTLSVerify is set to be true, the client side should set the
    #  flag. For Velero client Command like velero backup describe, velero backup logs needs to add the flag --insecure-skip-tls-verify
    #  insecureSkipTLSVerify:

  # Parameters for the VolumeSnapshotLocation(s). Configure multiple by adding other element(s) to the volumeSnapshotLocation slice.
  # See https://velero.io/docs/v1.6/api-types/volumesnapshotlocation/
  volumeSnapshotLocation:
    # name is the name of the volume snapshot location where snapshots are being taken. Required.
  - name: cinder
    # optional snapshot method:
    # * "snapshot" is a default cinder snapshot method
    # * "clone" is for a full volume clone instead of a snapshot allowing the
    # source volume to be deleted
    # * "backup" is for a full volume backup uploaded to a Cinder backup
    # allowing the source volume to be deleted (EXPERIMENTAL)
    # * "image" is for a full volume backup uploaded to a Glance image
    # allowing the source volume to be deleted (EXPERIMENTAL)
    # requires the "enable_force_upload" Cinder option to be enabled on the server
    method: snapshot
    volumeTimeout: 5m
    snapshotTimeout: 5m
    cloneTimeout: 5m
    backupTimeout: 5m
    imageTimeout: 5m
    # provider is the name for the volume snapshot provider.
    provider: community.openstack.org/openstack-cinder
    credential:
      # name of the secret used by this volumeSnapshotLocation.
      name: velero-credentials2
      # name of key that contains the secret data to be used.
      key: clouds
    # Additional provider-specific configuration. See link above
    # for details of required/optional fields for your provider.
    config: 
     auth_url: "https://identity.com/v3"
     region: eu-nl-1 
     domain-name: "cp"
     tenant-name: "gardener-canary-team"
  #    apiTimeout:
  #    resourceGroup:
  #    The ID of the subscription where volume snapshots should be stored, if different from the cluster’s subscription. If specified, also requires `configuration.volumeSnapshotLocation.config.resourceGroup`to be set. (Azure only)
  #    subscriptionId:
  #    incremental:
  #    snapshotLocation:
  #    project:
  - name: manila
    provider: community.openstack.org/openstack-manila
    credential:
      # name of the secret used by this volumeSnapshotLocation.
      name: velero-credentials2
      # name of key that contains the secret data to be used.
      key: clouds
    # Additional provider-specific configuration. See link above
    # for details of required/optional fields for your provider.
    config:
      # optional snapshot method:
      # * "snapshot" is a default manila snapshot method
      # * "clone" is for a full share clone instead of a snapshot allowing the
      # source share to be deleted
      method: snapshot
      region: eu-nl-1
      auth_url: "https://identity.com/v3"
      domain-name: "cp"
      tenant-name: "gardener-canary-team"      
      # optional Manila CSI driver name (default: nfs.manila.csi.openstack.org)
      driver: ceph.manila.csi.openstack.org
      # optional resource readiness timeouts in Golang time format: https://pkg.go.dev/time#ParseDuration
      # (default: 5m)
      shareTimeout: 5m
      snapshotTimeout: 5m
      cloneTimeout: 5m
      replicaTimeout: 5m
      # ensures that the Manila share/snapshot/replica is removed
      # this is a workaround to the https://bugs.launchpad.net/manila/+bug/2025641 and
      # https://bugs.launchpad.net/manila/+bug/1960239 bugs
      # if the share/snapshot/replica is in "error_deleting" status, the plugin will try
      # to reset its status (usually extra admin permissions are required) and delete it
      # again within the defined "cloneTimeout", "snapshotTimeout" or "replicaTimeout"
      ensureDeleted: "true"
      # a delay to wait between delete/reset actions when "ensureDeleted" is enabled
      ensureDeletedDelay: 10s
      # deletes all dependent share resources (i.e. snapshots, replicas) before deleting
      # the clone share (works only, when a snapshot method is set to clone)
      cascadeDelete: "true"
      # enforces availability zone checks when the availability zone of a
      # snapshot/share differs from the Velero metadata
      enforceAZ: "true"
  # These are server-level settings passed as CLI flags to the `velero server` command. Velero
  # uses default values if they're not passed in, so they only need to be explicitly specified
  # here if using a non-default value. The `velero server` default values are shown in the
  # comments below.
  # --------------------
  # `velero server` default: restic
  uploaderType:
  # `velero server` default: 1m
  backupSyncPeriod:
  # `velero server` default: 4h
  fsBackupTimeout:
  # `velero server` default: 30
  clientBurst:
  # `velero server` default: 500
  clientPageSize:
  # `velero server` default: 20.0
  clientQPS:
  # Name of the default backup storage location. Default: default
  defaultBackupStorageLocation:
  # How long to wait by default before backups can be garbage collected. Default: 72h
  defaultBackupTTL:
  # Name of the default volume snapshot location.
  defaultVolumeSnapshotLocations:
  # `velero server` default: empty
  disableControllers:
  # `velero server` default: 1h
  garbageCollectionFrequency:
  # Set log-format for Velero pod. Default: text. Other option: json.
  logFormat:
  # Set log-level for Velero pod. Default: info. Other options: debug, warning, error, fatal, panic.
  logLevel:
  # The address to expose prometheus metrics. Default: :8085
  metricsAddress:
  # Directory containing Velero plugins. Default: /plugins
  pluginDir:
  # The address to expose the pprof profiler. Default: localhost:6060
  profilerAddress:
  # `velero server` default: false
  restoreOnlyMode:
  # `velero server` default: customresourcedefinitions,namespaces,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,persistentvolumes,persistentvolumeclaims,secrets,configmaps,serviceaccounts,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io
  restoreResourcePriorities:
  # `velero server` default: 1m
  storeValidationFrequency:
  # How long to wait on persistent volumes and namespaces to terminate during a restore before timing out. Default: 10m
  terminatingResourceTimeout:
  # Bool flag to configure Velero server to move data by default for all snapshots supporting data movement. Default: false
  defaultSnapshotMoveData:
  # Comma separated list of velero feature flags. default: empty
  # features: EnableCSI
  features: EnableCSI
  # `velero server` default: velero
  namespace:

  # additional key/value pairs to be used as environment variables such as "AWS_CLUSTER_NAME: 'yourcluster.domain.tld'"
  extraEnvVars: {}

  # Set true for backup all pod volumes without having to apply annotation on the pod when used file system backup Default: false.
  defaultVolumesToFsBackup:

  # How often repository maintain is run for repositories by default.
  defaultRepoMaintainFrequency:

##
## End of backup/snapshot location settings.
##

##
## Settings for additional Velero resources.
##

rbac:
  # Whether to create the Velero role and role binding to give all permissions to the namespace to Velero.
  create: true
  # Whether to create the cluster role binding to give administrator permissions to Velero
  clusterAdministrator: true
  # Name of the ClusterRole.
  clusterAdministratorName: cluster-admin

# Information about the Kubernetes service account Velero uses.
serviceAccount:
  server:
    create: true
    name:
    annotations:
    labels:

# Info about the secret to be used by the Velero deployment, which
# should contain credentials for the cloud provider IAM account you've
# set up for Velero.
credentials:
  # Whether a secret should be used. Set to false if, for examples:
  # - using kube2iam or kiam to provide AWS IAM credentials instead of providing the key file. (AWS only)
  # - using workload identity instead of providing the key file. (Azure/GCP only)
  useSecret: true
  # Name of the secret to create if `useSecret` is true and `existingSecret` is empty
  name:
  # Name of a pre-existing secret (if any) in the Velero namespace
  # that should be used to get IAM account credentials. Optional.
  existingSecret: velero-credentials
  # Data to be stored in the Velero secret, if `useSecret` is true and `existingSecret` is empty.
  # As of the current Velero release, Velero only uses one secret key/value at a time.
  # The key must be named `cloud`, and the value corresponds to the entire content of your IAM credentials file.
  # Note that the format will be different for different providers, please check their documentation.
  # Here is a list of documentation for plugins maintained by the Velero team:
  # [AWS] https://github.com/vmware-tanzu/velero-plugin-for-aws/blob/main/README.md
  # [GCP] https://github.com/vmware-tanzu/velero-plugin-for-gcp/blob/main/README.md
  # [Azure] https://github.com/vmware-tanzu/velero-plugin-for-microsoft-azure/blob/main/README.md
  secretContents: 
  # additional key/value pairs to be used as environment variables such as "DIGITALOCEAN_TOKEN: <your-key>". Values will be stored in the secret.
  extraEnvVars: {}
  # Name of a pre-existing secret (if any) in the Velero namespace
  # that will be used to load environment variables into velero and node-agent.
  # Secret should be in format - https://kubernetes.io/docs/concepts/configuration/secret/#use-case-as-container-environment-variables
  extraSecretRef: ""

# Whether to create backupstoragelocation crd, if false => do not create a default backup location
backupsEnabled: true
# Whether to create volumesnapshotlocation crd, if false => disable snapshot feature
snapshotsEnabled: true

# Whether to deploy the node-agent daemonset.
deployNodeAgent: true

nodeAgent:
  podVolumePath: /var/lib/kubelet/pods
  privileged: false
  # Pod priority class name to use for the node-agent daemonset. Optional.
  priorityClassName: ""
  # Resource requests/limits to specify for the node-agent daemonset deployment. Optional.
  # https://velero.io/docs/v1.6/customize-installation/#customize-resource-requests-and-limits
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 1000m
      memory: 1024Mi

  # Tolerations to use for the node-agent daemonset. Optional.
  tolerations: []

  # Annotations to set for the node-agent daemonset. Optional.
  annotations: {}

  # labels to set for the node-agent daemonset. Optional.
  labels: {}

  # will map /scratch to emptyDir. Set to false and specify your own volume
  # via extraVolumes and extraVolumeMounts that maps to /scratch
  # if you don't want to use emptyDir.
  useScratchEmptyDir: true

  # Extra volumes for the node-agent daemonset. Optional.
  extraVolumes: []

  # Extra volumeMounts for the node-agent daemonset. Optional.
  extraVolumeMounts: []

  # Key/value pairs to be used as environment variables for the node-agent daemonset. Optional.
  extraEnvVars: {}

  # Configure the dnsPolicy of the node-agent daemonset
  # See: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy
  dnsPolicy: ClusterFirst

  # SecurityContext to use for the Velero deployment. Optional.
  # Set fsGroup for `AWS IAM Roles for Service Accounts`
  # see more informations at: https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html
  podSecurityContext:
    runAsUser: 0
    # fsGroup: 1337

  # Container Level Security Context for the 'node-agent' container of the node-agent daemonset. Optional.
  # See: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-the-security-context-for-a-container
  containerSecurityContext: {}

  # Container Lifecycle Hooks to use for the node-agent daemonset. Optional.
  lifecycle: {}

  # Node selector to use for the node-agent daemonset. Optional.
  nodeSelector: {}

  # Affinity to use with node-agent daemonset. Optional.
  affinity: {}

  # DNS configuration to use for the node-agent daemonset. Optional.
  dnsConfig: {}

# Backup schedules to create.
# Eg:
# schedules:
#   mybackup:
#     disabled: false
#     labels:
#       myenv: foo
#     annotations:
#       myenv: foo
#     schedule: "0 0 * * *"
#     useOwnerReferencesInBackup: false
#     template:
#       ttl: "240h"
#       storageLocation: default
#       includedNamespaces:
#       - foo
schedules: {}

# Velero ConfigMaps.
# Eg:
# configMaps:
    # See: https://velero.io/docs/v1.11/file-system-backup/
#   fs-restore-action-config:
#     labels:
#       velero.io/plugin-config: ""
#       velero.io/pod-volume-restore: RestoreItemAction
#     data:
#       image: velero/velero-restore-helper:v1.10.2
#       cpuRequest: 200m
#       memRequest: 128Mi
#       cpuLimit: 200m
#       memLimit: 128Mi
#       secCtx: |
#         capabilities:
#           drop:
#           - ALL
#           add: []
#         allowPrivilegeEscalation: false
#         readOnlyRootFilesystem: true
#         runAsUser: 1001
#         runAsGroup: 999
configMaps: {}

##
## End of additional Velero resource settings.
##

install command

helm install velero vmware-tanzu/velero  -f values.yaml
jguipi commented 7 months ago

Extra information:

image
Lirt commented 7 months ago

Hello,

The config looks correct. You need to first apply the TMP URL configuration. Please reach back to me after you do it to check if the issue is solved.

Lirt commented 7 months ago

Secondary the Backup or Schedule has to reference storageLocation: default (or your name of BackupStorageLocation).

jguipi commented 7 months ago

You might need to upgrade your swift command:

This one is returning an error on MacOS

SWIFT_TMP_URL_KEY=$(dd if=/dev/urandom | tr -dc A-Za-z0-9 | head -c 40)
tr: Illegal byte sequence

An alternative that I found was

SWIFT_TMP_URL_KEY=$(dd if=/dev/urandom | LC_ALL=C tr -dc A-Za-z0-9 | head -c 40)

I ran the command to do the Swift Container Setup and it went well.

I setup the BackupStorageLocation like you mentionned, but now it seems like the error that I'm getting comes from the backuprepositories.velero.io

image

logs from Velero pod

time="2024-02-19T15:18:45Z" level=info msg="pod default/my-pod has volumes to backup: [my-volume]" backup=velero/test223 logSource="pkg/podvolume/backupper.go:172" name=my-pod namespace=default resource=pods
time="2024-02-19T15:18:45Z" level=info msg="No repository found, creating one" backupLocation=default logSource="pkg/repository/ensurer.go:89" repositoryType=restic volumeNamespace=default
time="2024-02-19T15:18:45Z" level=info msg="Initializing backup repository" backupRepo=velero/default-default-restic-wlv7b logSource="pkg/controller/backup_repository_controller.go:216"
time="2024-02-19T15:18:45Z" level=info msg="Set matainenance according to repository suggestion" frequency=168h0m0s logSource="pkg/controller/backup_repository_controller.go:263"
time="2024-02-19T15:18:45Z" level=error msg="Restic command fail with ExitCode: 1. Process ID is 297, Exit error is: exit status 1" logSource="pkg/util/exec/exec.go:66"
time="2024-02-19T15:18:45Z" level=error msg="Error checking repository for stale locks" backupRepo=velero/default-default-restic-wlv7b error="error running command=restic unlock --repo= --password-file=/tmp/credentials/velero/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=, stderr=Fatal: Please specify repository location (-r or --repository-file)\n: exit status 1" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/repository/restic/repository.go:123" error.function="github.com/vmware-tanzu/velero/pkg/repository/restic.(*RepositoryService).exec" logSource="pkg/controller/backup_repository_controller.go:184"
time="2024-02-19T15:18:45Z" level=info msg="Checking backup repository for readiness" backupRepo=velero/default-default-restic-wlv7b logSource="pkg/controller/backup_repository_controller.go:307"
time="2024-02-19T15:18:46Z" level=info msg="1 errors encountered backup up item" backup=velero/test223 logSource="pkg/backup/backup.go:457" name=my-pod
time="2024-02-19T15:18:46Z" level=error msg="Error backing up item" backup=velero/test223 error="failed to wait BackupRepository: backup repository is not ready: error to get identifier for repo default-default-restic-wlv7b: invalid backend type community.openstack.org/openstack, provider community.openstack.org/openstack" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/repository/backup_repo_op.go:83" error.function=github.com/vmware-tanzu/velero/pkg/repository.GetBackupRepository logSource="pkg/backup/backup.go:461" name=my-pod

Thank you again for your help !

Lirt commented 7 months ago

Regarding this error it looks like Velero code only allows certain providers to do filesystem backup. It's configured here in the code. For this I have to redirect you to Velero FSB documentation as restic or other file-system-backup providers are not a topic of this plugin.

time="2024-02-19T15:18:46Z" 
level=error 
msg="Error backing up item" backup=velero/test223 
error="failed to wait BackupRepository: backup repository is not ready: 
             error to get identifier for repo default-default-restic-wlv7b: 
             invalid backend type community.openstack.org/openstack, 
             provider community.openstack.org/openstack" 
error.file="/go/src/github.com/vmware-tanzu/velero/pkg/repository/backup_repo_op.go:83"
error.function=github.com/vmware-tanzu/velero/pkg/repository.GetBackupRepository 
logSource="pkg/backup/backup.go:461" name=my-pod

Besides that there is a restic error that you need to solve but I'm not sure if solving this can fix the error above.

time="2024-02-19T15:18:45Z" 
level=error 
msg="Error checking repository for stale locks" 
backupRepo=velero/default-default-restic-wlv7b 
error="error running 
             command=restic unlock \
                     --repo= \
                     --password-file=/tmp/credentials/velero/velero-repo-credentials-repository-password \
                     --cache-dir=/scratch/.cache/restic, 
stdout=, 
stderr=Fatal: Please specify repository location (-r or --repository-file)\n: 
             exit status 1" 
             error.file="/go/src/github.com/vmware-tanzu/velero/pkg/repository/restic/repository.go:123" 
             error.function="github.com/vmware-tanzu/velero/pkg/repository/restic.(*RepositoryService).exec" logSource="pkg/controller/backup_repository_controller.go:184"

If you don't want to do file-system-backup, you can disable restic and snapshots will work fine.

jguipi commented 7 months ago

looks like Velero code only allows certain providers to do filesystem backup

Nice catch.

Unfortunately, even when I disable restic and run a normal velero backup create test3 --include-namespaces=default I'm getting the issues that I had previously mentioning that the volume is in use.

time="2024-02-19T17:22:39Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-c9c3c980-d38b-46c5-91ea-be3976aa035b to have snapshot handle. Retrying in 5s" backup=velero/test3 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:259" pluginName=velero-plugin-for-csi
time="2024-02-19T17:22:39Z" level=warning msg="Volumesnapshotcontent snapcontent-c9c3c980-d38b-46c5-91ea-be3976aa035b has error: Failed to check and update snapshot content: failed to take snapshot of the volume c91638ec-6362-4e15-b3cd-e2f957e3e6b5: \"rpc error: code = Internal desc = CreateSnapshot failed with error Bad request with: [POST https://volume-3.eu-nl-1.cloud.sap:443/v3/ddcde3b2cae24ea0a85ad5b608b7ac97/snapshots], error message: {\\\"badRequest\\\": {\\\"code\\\": 400, \\\"message\\\": \\\"Invalid volume: Volume c91638ec-6362-4e15-b3cd-e2f957e3e6b5 status must be available, but current status is: in-use.\\\"}}\"" backup=velero/test3 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:261" pluginName=velero-plugin-for-csi

Having previously achieved success backing up everything in AWS S3, I think returning to that option would provide ample support for various backup requirements.

Thank you for your assistance. If no other solutions address the ongoing "in use" issue, I suggest closing the ticket.

Lirt commented 7 months ago

:smile: This is still not a log of this plugin - you can see this in log pluginName=velero-plugin-for-csi. Please read error logs carefully or you will be deceiving yourself!

In a simple standard setup you should have 1 BSL and 1 VSL.

---
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: my-awesome-bsl
  namespace: <NAMESPACE>
spec:
  objectStorage:
    bucket: <CONTAINER_NAME>
  provider: community.openstack.org/openstack

---
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
  name: my-awesome-vsl
  namespace: <NAMESPACE>
spec:
  provider: community.openstack.org/openstack-cinder
  config:
    volumeTimeout: 5m
    snapshotTimeout: 5m
    cloneTimeout: 5m
    backupTimeout: 5m
    imageTimeout: 5m
    ensureDeleted: "true"
    ensureDeletedDelay: 10s
    cascadeDelete: "true"

Then run velero backup with explicitly defined storage location and volume snapshot location. This will ensure that you don't backup using a different configuration (or provider):

velero backup create \
            --namespace velero \
            --include-namespaces default \
            --snapshot-volumes true \
            --storage-location my-awesome-bsl \
            --volume-snapshot-locations my-awesome-vsl \
            --wait

Can you try it this way?

(This example might miss something, I was writing it from my memory)

jguipi commented 7 months ago

Ok great I was able to create a few successful backup by :

  1. Being more precise with the command that I use like you mention in your previous command (both are working)
    
    velero backup create test2345678 --include-namespaces=default --snapshot-volumes=true --storage-location=default --volume-snapshot-locations=manila --wait

velero backup create test2345678 --include-namespaces=default --snapshot-volumes=true --storage-location=default --volume-snapshot-locations=cinder --wait

2. Disabling the initContainers `velero-plugin-for-csi` and only keep the `velero-plugin-for-openstack`
```yaml
initContainers:
  # - name: velero-plugin-for-csi
  #   image: velero/velero-plugin-for-csi:v0.7.0
  #   imagePullPolicy: IfNotPresent
  #   volumeMounts:
  #     - mountPath: /target
  #       name: plugins
  - name: velero-plugin-for-openstack
    image: lirt/velero-plugin-for-openstack:v0.6.0
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins
  1. I kept CSI enabled in configuration.features (it will not work without it. On my side, it was throwing an authentification error) image

image

Thanks a lot for your help ! Hopefully it will be able to help other people as well.

Lirt commented 7 months ago

Great to hear that.