RamenDR / ShioRamen

Apache License 2.0
6 stars 2 forks source link

Align VRG KubeObjectProtection API with Fusion recipe (low level design + implementation: sprint 12) #85

Closed tjanssen3 closed 1 year ago

tjanssen3 commented 2 years ago

Tasks

asdf

tjanssen3 commented 2 years ago

Hooks

General

Velero Backup Hooks

Use Cases/Examples

Freeze/Unfreeze

This could likely be wrapped into disabling KubeObjectProtection.

Sample yaml snippet: Backup

kind: VolumeReplicationGroup
metadata:
  name: recipe-alignment-sample
spec: 
  kubeObjectProtection:
    captureOrder:
      - includedResources:  ["Pod", "ConfigMap", "Secret"]  # resources to back up
      # TODO: determine how to select between "once" and "every object" behavior
        hooks: 
          - name: checkpoint  # will show up in Velero backup logs with this label
            container: app-container  # container containing scripts used in hooks
            labelSelector:  # optional. LabelSelectors can be used to limit the scope of which Pods run Hooks
              matchLabels:
                app: my-app
            pre:  # pre-hook: run before backup
                command:  # run each command sequentially
                  - /cpdbr-scripts/cpbdr/checkpoint_create.sh;  # use ';' to separate multiple commands - Velero will otherwise interpret this as a single command
                  - /cpdbr-scripts/cpdbr/checkpoint_backup_prehooks.sh
                onError: Fail  # behavior on error; defaults to Fail - other option is "Continue"
                timeout: 1800  # seconds. Timeout for this pre-hook instance.
            post:  # post-hook: run after backup. 
                # Parts below have the same functionality as pre-hook
                command: 
                  - /cpdbr-scripts/cpbdr/checkpoint_backup_posthooks.sh
                onError: Fail
                timeout: 600 # seconds

Example Notes

The intention of the snippet above is to describe a Backup hook. The desired behavior follows:

  1. Run pre-hook "checkpoint". This contains two commands, checkpoint_create.sh and checkpoint_backup_prehooks.sh, which execute sequentially.
  2. Back up all Pods, ConfigMaps and Secrets in current namespace.
  3. Run post-hook "checkpoint". This runs one command: checkpoint_backup_posthooks.sh.

Issues

  1. Velero only runs hooks on Pods and in specific containers. If the user does not specify Pods in the Backup Spec, the hooks are removed by Velero and not run. This may be undesirable, since users may only want to back up Deployments, which can create Pods.
  2. An error will be produced by the Hook in the following circumstances: 1) the Container specified doesn't exist, 2) if onFail=Fail, any runtime error experienced by the Hook will produce an error (even if there is partial success).
  3. Velero runs Backup Hooks on every pod. This may not be what the user wants all the time. If a user needs "run once" or "only run on X pods" behavior, use label selectors. The label selectors scoped to the Hook do not apply to the Backup; these are separate.
  4. If there are multiple pods that need Hooks, but the behavior is different per Pod, the user has two ways to enable this: 1) use a single Hook definition that runs a script with a path that is common to all of the Pods, 2) specify multiple Hooks, with each Hook targeting a unique Pod with unique behavior.

Further details about Restores will be added in a subsequent comment.

tjanssen3 commented 2 years ago

Velero Restore Hooks

Sample yaml snippet: Restore with InitContainers

kind: VolumeReplicationGroup
metadata:
  name: recipe-alignment-sample-1
spec: 
  kubeObjectProtection:
    recoverOrder:
    - backupName: checkpoint
      includeResources: ["Pod", "ConfigMap", "Secret"]
      hooks:
      - name: restore-hook-1  # name of the restore hook (covers both initContainer and exec)
        labelSelector:  # optional label selector for filtering Pods
          matchLabels:
            app: my-app
        initContainers:  # creates a pod with details below BEFORE restore
        - name: restore-hook-init  # name of the initContainer to create
          image: my-container-image  # image to create the initContainer from
          volumeMounts:  # if mount paths are required, specify them here
          - mountPath: /restores
            name: pvc-restore-init-container
          command:  # commands to run in sequence. Separate them with ';'
          - /bin/bash
          - -c
          - echo -n "hook: init container" >> /restores/pvc-restore-init-container-log

Example 1 Notes

The intention of the snippet above is to describe an InitContainers Restore Hook. The desired behavior follows:

  1. Before restoring Pod with matching Container name, create InitContainer named restore-hook-init with specified mount path.
  2. Run commands specified in InitContainer Hook.
  3. Restore Pod with matching Container name.

Sample yaml snippet: Restore with Exec Hook

kind: VolumeReplicationGroup
metadata:
  name: recipe-alignment-sample-2
spec: 
  kubeObjectProtection:
    recoverOrder:
    - backupName: checkpoint
      includeResources: ["Pod", "ConfigMap", "Secret"]
      hooks:
      - name: restore-hook-2  # name of the restore hook (covers both initContainer and exec)
        labelSelector:  # optional label selector for filtering Pods
          matchLabels:
            app: my-app
        exec:  # runs on specified containers; sequencing similar to a "post hook"
           container: app-container  # run commands on specified container
           command:  # commands to run sequentially on specified container
             - /cpdbr-scripts/cpdbr/checkpoint_restore_preworkloadhooks.sh
           onError: Fail  # on error, either Fail (end) or Continue
           timeout: 600  # timeout duration

Example 2 notes

The intention of the snippet above is to describe an Exec Restore Hook. The desired behavior follows:

  1. Pods, ConfigMaps and Secrets are restored.
  2. Once Pod with matching Container field is running, run command /cpdbr-scripts/cpdbr/checkpoint_restore_preworkloadhooks.sh on it.

General notes

  1. The hook names are optional. These were added to give additional debug information that Velero error-reporting supports; they are not required for the design.
tjanssen3 commented 2 years ago

Update based on Recipe adoption, based on the sample from Andy. initContainers do not make an appearance, but I believe we'll need them later, so this example shows a hook.type=exec example, which could be changed to hook.type=initContainer with additional support.

kind: VolumeReplicationGroup
metadata:
  name: recipe-sample-cpd-instance
spec: 
  kubeObjectProtection:
    recipe:
      groups:
      - name: cpd-instance-volumes
        labelSelector: icpdsupport/empty-on-nd-backup
         # backupName excluded - restore on
         includedResoures:
         - pv
         - pvc
         isClusterScoped: false
       - name: cpd-instance-resources
         labelSelector: icpdsupport/ignore-on-nd-backup
         excludedResourceTypes:
         - pv
         - pvc
         - event
         - event.events.k8s.io
      - name: cpd-instance-pre-workload-resources
        backupName: cpd-instance-resources
        excludedResources: 
        - deployments.apps
        - statefulsets.apps
        - daemonsets.apps
        - replicasets.apps
        - controllerrevisions.apps
        - cronjobs.batch
        - pods
        - operandrequests.operator.ibm.com
        - clients
        - imagetags.openshift.io
      - name: cpd-instance-workload-resources
        backupName: cpd-instance-resources
        includedResourceTypes:
        - deployments.apps
        - statefulsets.apps
        - daemonsets.apps
        - replicasets.apps
        - controllerrevisions.apps
        - cronjobs.batch
        - jobs.batch 
      - name: cpd-instance-operator-resources
        backupName: cpd-instance-resources
        includedResourceTypes:
        - operandrequests.operator.ibm.com
      hooks:
      - name: checkpoint
         type: exec
         config:
            container: main
           timeout: 1800
           onError: Fail
           command:
           - /cpdbr-scripts/cpdbr/checkpoint_create.sh
      - name: pre-backup
         type: exec
         config:
           container: main
           timeout: 600
           command:
           - /cpdbr-scripts/cpdbr/checkpoint_backup_prehooks.sh
      - name: post-backup
         type: exec
         config:
           container: main
           timeout: 600
           command:
           - /cpdbr-scripts/cpdbr/checkpoint_backup_posthooks.sh
      - name: pre-workload
         type: exec
         config:
           container: main
           timeout: 600
           command:
           - /cpdbr-scripts/cpdbr/checkpoint_restore_preworkloadhooks.sh
      - name: post-workload
         type: exec
         config:
           container: main
           timeout: 3600
           command:
           - /cpdbr-scripts/cpdbr/checkpoint_restore_posthooks.sh
         initContainer:
         - name
           image
           volumeMounts:
           command:
      workflows:
      - name: backup
        sequence:  # format = type: name
        - hook: checkpoint
        - hook: pre-backup
        - group: cpd-instance-volumes
        - hook: post-backup
        - group_ cpd-instance-resources
      - name: restore
        sequence:
        - group: cpd-instance-volumes
        - group: cpd-instance-pre-workload-resources
        - hook: pre-workload
        - group: cpd-instance-workload-resources
        - hook: post-workload
        - group: cpd-instance-operator-resources
hatfieldbrian commented 1 year ago

Fixed by https://github.com/RamenDR/ramen/pull/675