Open tculp opened 1 year ago
Hi @tculp
Can you provide a set of steps to reproduce this issue? looks like a bug to me but not sure how to reproduce it
Regards!
Hang on, the backup section does not take any backup. You need to set a ScheduledBackup
object for that if you want to regularly take them, or create a Backup
object for on-demand backups.
By backup section, I mean for example:
backup:
barmanObjectStore:
destinationPath: "s3://example-bucket"
s3Credentials:
inheritFromIAMRole: true
retentionPolicy: "30d"
Step-by-step:
Apply the following manifest:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: example-cluster-1
spec:
instances: 3
storage:
size: 1Gi
bootstrap:
initdb:
database: customdb
A cluster is created, without performing backups.
Update the manifest to include serviceAccountTemplate
and backup
sections:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: example-cluster-1
spec:
instances: 3
storage:
size: 1Gi
serviceAccountTemplate:
metadata:
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::<redacted>"
backup:
barmanObjectStore:
destinationPath: "s3://example-<redacted>"
s3Credentials:
inheritFromIAMRole: true
bootstrap:
initdb:
database: customdb
The ServiceAccount is updated with the new annotation, as expected.
No pods are restarted to pick up the new backup
config, etc.
No backups appear in the specified bucket.
This is the point where I would expect the controller to realize that the manifest has changed, restart pods, and do whatever is needed to get object storage backups working.
However, to investigate behavior, I'll proceed.
Update the manifest to include a Backup.
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: example-cluster-1
spec:
instances: 3
storage:
size: 1Gi
serviceAccountTemplate:
metadata:
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::<redacted>"
backup:
barmanObjectStore:
destinationPath: "s3://example-<redacted>"
s3Credentials:
inheritFromIAMRole: true
bootstrap:
initdb:
database: customdb
---
apiVersion: postgresql.cnpg.io/v1
kind: Backup
metadata:
name: backup-example-1
spec:
cluster:
name: example-cluster-1
A Backup is created, which stays in running for a while, but eventually gets to walArchivingFailing.
Kill the cluster pods one at a time Objects appear in the bucket in the expected location (bucket/exmaple-cluster-1/wals).
Create a new backup:
...
---
apiVersion: postgresql.cnpg.io/v1
kind: Backup
metadata:
name: backup-example-2
spec:
cluster:
name: example-cluster-1
The new backup gets to phase completed
.
Create a new cluster, using the objectStore of the previous one:
...
---
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: cluster-example-2
spec:
instances: 3
storage:
size: 1Gi
serviceAccountTemplate:
metadata:
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::<redacted>"
backup:
barmanObjectStore:
destinationPath: "s3://example-<redacted>"
s3Credentials:
inheritFromIAMRole: true
retentionPolicy: "30d"
bootstrap:
recovery:
source: example-cluster-1
externalClusters:
- name: example-cluster-1
barmanObjectStore:
destinationPath: "s3://example-<redacted>"
serverName: example-cluster-1
s3Credentials:
inheritFromIAMRole: true
I think this last step failed with an error last time, but this time it seemed to work fine. One potential difference may be that last time I only killed some of the pods before WALs showed up in object storage, and then the new cluster off of a backup failed. Also the first time I was making ScheduledBackups instead of Backups directly, but I don't see why that would have made a difference.
I have a similar issue.
How to reproduce:
Create a cluster using kubectl apply -f on yaml file like this
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: cluster-example
namespace: cluster1
spec:
instances: 2
imageName: ghcr.io/cloudnative-pg/postgresql:14.9
primaryUpdateStrategy: unsupervised
bootstrap:
recovery:
source: cluster-example
externalClusters:
- name: cluster-example
barmanObjectStore:
serverName: test-backup-3
destinationPath: s3://lab-cnpg
endpointURL: https://[REDACTED]
s3Credentials:
accessKeyId:
name: s3-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: s3-creds
key: ACCESS_SECRET_KEY
wal:
maxParallel: 8
storage:
size: 5Gi
The cluster goes up ad loads the backup correctly
Now edit the definition and add a ScheduledBackup object too. Reapply with kubectl apply -f edited_file.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: cluster-example
namespace: cluster1
spec:
instances: 2
imageName: ghcr.io/cloudnative-pg/postgresql:14.9
primaryUpdateStrategy: unsupervised
backup:
retentionPolicy: "3d"
barmanObjectStore:
serverName: test-backup-4
destinationPath: s3://lab-cnpg
endpointURL: https://[REDACTED]
s3Credentials:
accessKeyId:
name: s3-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: s3-creds
key: ACCESS_SECRET_KEY
bootstrap:
recovery:
source: cluster-example
externalClusters:
- name: cluster-example
barmanObjectStore:
serverName: test-backup-3
destinationPath: s3://lab-cnpg
endpointURL: https://[REDACTED]
s3Credentials:
accessKeyId:
name: s3-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: s3-creds
key: ACCESS_SECRET_KEY
wal:
maxParallel: 8
storage:
size: 5Gi
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: backup-example
namespace: cluster1
spec:
immediate: true
schedule: "0 0 0 * * *"
backupOwnerReference: self
cluster:
name: cluster-example
After a while, the test-backup-4 is created into the object store. Now try to recovery from the test-backup-4 will fail with
"Error while restoring a backup","logging_pod":"cluster-example-1-full-recovery","error":"encountered an error while checking the presence of first needed WAL in the archive: file not found 000000040000000000000016: WAL not found"
When I got an error with loading I also got an encountered an error while checking the presence of first needed WAL in the archive
message
I have also noticed the same problem:
If the database was created without the barmanObjectStore section, it is not possible to restore the database from backup even by adding it later.
If WALs were generated between the creation of the cluster and the addition of the barmanObjectStore section they will be lost and this will prevent a restore via bootstrap.recovery.source with error:
encountered an error while checking the presence of first needed WAL in the archive: file not found 000000010000000000000010: WAL not found
On source database, the missing file is logged as skipped due to backup not configured:
{"level":"info","ts":"2023-12-04T14:17:33Z","logger":"wal-archive","msg":"Backup not configured, skip WAL archiving","logging_pod":"[cluster-name]-1","walName":"pg_wal/000000010000000000000010","currentPrimary":"[cluster-name]-1","targetPrimary":"[cluster-name]1"}
Only the emergency backup procedure allowed me to complete the restore.
@sxd This is a potentially huge data security issue, can this get a tad more attention please?
If you create a cluster without a backup section, and later add one (and an annotation to the serviceaccount template), nothing happens to the cluster as far as I can tell. Backups do not start showing up in the object store, the pods don't get restarted to pick up the new config, etc.
The Service Account will get a new annotation though.
I killed one of the replica pods to force a reload and then created a backup object, which seemed to work, but the resulting backup didn't successfully load into a new cluster. Maybe it would if I let the cluster live long enough.
In any case, it would be nice if the operator would detect that it should now start taking backups, and go through whatever steps required to make that happen.