cloudnative-pg / cloudnative-pg

CloudNativePG is a comprehensive platform designed to seamlessly manage PostgreSQL databases within Kubernetes environments, covering the entire operational lifecycle from initial deployment to ongoing maintenance
https://cloudnative-pg.io
Apache License 2.0
4.37k stars 300 forks source link

It should be possible to add a backup section to an existing cluster #2668

Open tculp opened 1 year ago

tculp commented 1 year ago

If you create a cluster without a backup section, and later add one (and an annotation to the serviceaccount template), nothing happens to the cluster as far as I can tell. Backups do not start showing up in the object store, the pods don't get restarted to pick up the new config, etc.

The Service Account will get a new annotation though.

I killed one of the replica pods to force a reload and then created a backup object, which seemed to work, but the resulting backup didn't successfully load into a new cluster. Maybe it would if I let the cluster live long enough.

In any case, it would be nice if the operator would detect that it should now start taking backups, and go through whatever steps required to make that happen.

sxd commented 1 year ago

Hi @tculp

Can you provide a set of steps to reproduce this issue? looks like a bug to me but not sure how to reproduce it

Regards!

gbartolini commented 1 year ago

Hang on, the backup section does not take any backup. You need to set a ScheduledBackup object for that if you want to regularly take them, or create a Backup object for on-demand backups.

tculp commented 1 year ago

By backup section, I mean for example:

  backup:
    barmanObjectStore:
      destinationPath: "s3://example-bucket"
      s3Credentials:
        inheritFromIAMRole: true
    retentionPolicy: "30d"
tculp commented 1 year ago

Step-by-step:

  1. Apply the following manifest:

    apiVersion: postgresql.cnpg.io/v1
    kind: Cluster
    metadata:
    name: example-cluster-1
    spec:
    instances: 3
    storage:
    size: 1Gi
    bootstrap:
    initdb:
      database: customdb

    A cluster is created, without performing backups.

  2. Update the manifest to include serviceAccountTemplate and backup sections:

    apiVersion: postgresql.cnpg.io/v1
    kind: Cluster
    metadata:
    name: example-cluster-1
    spec:
    instances: 3
    storage:
    size: 1Gi
    serviceAccountTemplate:
    metadata:
      annotations:
        eks.amazonaws.com/role-arn: "arn:aws:iam::<redacted>"
    backup:
    barmanObjectStore:
      destinationPath: "s3://example-<redacted>"
      s3Credentials:
        inheritFromIAMRole: true
    bootstrap:
    initdb:
      database: customdb

    The ServiceAccount is updated with the new annotation, as expected.

No pods are restarted to pick up the new backup config, etc.

No backups appear in the specified bucket.

This is the point where I would expect the controller to realize that the manifest has changed, restart pods, and do whatever is needed to get object storage backups working.

However, to investigate behavior, I'll proceed.

  1. Update the manifest to include a Backup.

    apiVersion: postgresql.cnpg.io/v1
    kind: Cluster
    metadata:
    name: example-cluster-1
    spec:
    instances: 3
    storage:
    size: 1Gi
    serviceAccountTemplate:
    metadata:
      annotations:
        eks.amazonaws.com/role-arn: "arn:aws:iam::<redacted>"
    backup:
    barmanObjectStore:
      destinationPath: "s3://example-<redacted>"
      s3Credentials:
        inheritFromIAMRole: true
    bootstrap:
    initdb:
      database: customdb
    ---
    apiVersion: postgresql.cnpg.io/v1
    kind: Backup
    metadata:
    name: backup-example-1
    spec:
    cluster:
    name: example-cluster-1

    A Backup is created, which stays in running for a while, but eventually gets to walArchivingFailing.

  2. Kill the cluster pods one at a time Objects appear in the bucket in the expected location (bucket/exmaple-cluster-1/wals).

  3. Create a new backup:

    ...
    ---
    apiVersion: postgresql.cnpg.io/v1
    kind: Backup
    metadata:
    name: backup-example-2
    spec:
    cluster:
    name: example-cluster-1

    The new backup gets to phase completed.

  4. Create a new cluster, using the objectStore of the previous one:

    ...
    ---
    apiVersion: postgresql.cnpg.io/v1
    kind: Cluster
    metadata:
    name: cluster-example-2
    spec:
    instances: 3
    storage:
    size: 1Gi
    serviceAccountTemplate:
    metadata:
      annotations:
        eks.amazonaws.com/role-arn: "arn:aws:iam::<redacted>"
    backup:
    barmanObjectStore:
      destinationPath: "s3://example-<redacted>"
      s3Credentials:
        inheritFromIAMRole: true
    retentionPolicy: "30d"
    bootstrap:
    recovery:
      source: example-cluster-1
    
    externalClusters:
    - name: example-cluster-1
      barmanObjectStore:
        destinationPath: "s3://example-<redacted>"
        serverName: example-cluster-1
        s3Credentials:
          inheritFromIAMRole: true

    I think this last step failed with an error last time, but this time it seemed to work fine. One potential difference may be that last time I only killed some of the pods before WALs showed up in object storage, and then the new cluster off of a backup failed. Also the first time I was making ScheduledBackups instead of Backups directly, but I don't see why that would have made a difference.

ferama commented 1 year ago

I have a similar issue.

How to reproduce:

Create a cluster using kubectl apply -f on yaml file like this

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: cluster-example
  namespace: cluster1
spec:
  instances: 2
  imageName: ghcr.io/cloudnative-pg/postgresql:14.9
  primaryUpdateStrategy: unsupervised
  bootstrap:
    recovery:
      source: cluster-example
  externalClusters:
    - name: cluster-example
      barmanObjectStore:
        serverName: test-backup-3
        destinationPath: s3://lab-cnpg
        endpointURL: https://[REDACTED]
        s3Credentials:
          accessKeyId:
            name: s3-creds
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: s3-creds
            key: ACCESS_SECRET_KEY
        wal:
          maxParallel: 8
  storage:
    size: 5Gi

The cluster goes up ad loads the backup correctly

Now edit the definition and add a ScheduledBackup object too. Reapply with kubectl apply -f edited_file.yaml

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: cluster-example
  namespace: cluster1
spec:
  instances: 2
  imageName: ghcr.io/cloudnative-pg/postgresql:14.9
  primaryUpdateStrategy: unsupervised
  backup:
    retentionPolicy: "3d"
    barmanObjectStore:
      serverName: test-backup-4
      destinationPath: s3://lab-cnpg
      endpointURL: https://[REDACTED]
      s3Credentials:
        accessKeyId:
          name: s3-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: s3-creds
          key: ACCESS_SECRET_KEY
  bootstrap:
    recovery:
      source: cluster-example
  externalClusters:
    - name: cluster-example
      barmanObjectStore:
        serverName: test-backup-3
        destinationPath: s3://lab-cnpg
        endpointURL: https://[REDACTED]
        s3Credentials:
          accessKeyId:
            name: s3-creds
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: s3-creds
            key: ACCESS_SECRET_KEY
        wal:
          maxParallel: 8
  storage:
    size: 5Gi
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: backup-example
  namespace: cluster1
spec:
  immediate: true
  schedule: "0 0 0 * * *"
  backupOwnerReference: self
  cluster:
    name: cluster-example

After a while, the test-backup-4 is created into the object store. Now try to recovery from the test-backup-4 will fail with

"Error while restoring a backup","logging_pod":"cluster-example-1-full-recovery","error":"encountered an error while checking the presence of first needed WAL in the archive: file not found 000000040000000000000016: WAL not found"
tculp commented 1 year ago

When I got an error with loading I also got an encountered an error while checking the presence of first needed WAL in the archive message

LucaCominoli21 commented 10 months ago

I have also noticed the same problem:

If the database was created without the barmanObjectStore section, it is not possible to restore the database from backup even by adding it later.

If WALs were generated between the creation of the cluster and the addition of the barmanObjectStore section they will be lost and this will prevent a restore via bootstrap.recovery.source with error:

encountered an error while checking the presence of first needed WAL in the archive: file not found 000000010000000000000010: WAL not found

On source database, the missing file is logged as skipped due to backup not configured:

{"level":"info","ts":"2023-12-04T14:17:33Z","logger":"wal-archive","msg":"Backup not configured, skip WAL archiving","logging_pod":"[cluster-name]-1","walName":"pg_wal/000000010000000000000010","currentPrimary":"[cluster-name]-1","targetPrimary":"[cluster-name]1"}

Only the emergency backup procedure allowed me to complete the restore.

PrivatePuffin commented 9 months ago

@sxd This is a potentially huge data security issue, can this get a tad more attention please?