CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.95k stars 593 forks source link

My pgbackrest-repo disk has filled... how do I delete some backups from it? #2111

Closed alrooney closed 3 years ago

alrooney commented 3 years ago

pgo 4.5.0

My pgbackrest-repo disk has filled up because I set my retention policy incorrectly. I tried doing a backup with retention set to lower number in the hope that it would delete old backups first before trying to do the backup, but that failed because I assume it tried to do the backup first and the disk is full. How can I safely remove old backups in the repo to make some space?

jkatz commented 3 years ago

In an emergency situation, you can manually execute pgbackrest expire on the Pod. I would suggest expiring the oldest backup.

Please note all of the warnings associated with running this command.

alrooney commented 3 years ago

Thanks - so I should exec into the -backrest-shared-repo pod and then run the pgbackrest expire command there or should I run it elsewhere?

alrooney commented 3 years ago

Looking through the pgbackrest docs I'm confused how to run that expire command. I did try running a backup as follows but this did not work:

pgo backup retroelk-prod1-azure --backup-opts="--type=full --repo1-retention-full=7 expire" -n pgo
jkatz commented 3 years ago

Yes, you can execute it on the pgBackRest Pod.

Here is an example. In this case, I'm going to delete my latest backup.

🚨 As mentioned above, there are a ton of warnings associated with running pgbackrest expire, so please do so at your own discretion 🚨

  1. Get a list of backups
pgo show backup lion

cluster: lion
storage type: local

stanza: db
    status: ok
    cipher: none

    db (current)
        wal archive min/max (13-1)

        full backup: 20201209-144519F
            timestamp start/stop: 2020-12-09 14:45:19 +0000 UTC / 2020-12-09 14:45:29 +0000 UTC
            wal start/stop: 000000010000000000000002 / 000000010000000000000002
            database size: 31.0MiB, backup size: 31.0MiB
            repository size: 3.8MiB, repository backup size: 3.8MiB
            backup reference list: 

        incr backup: 20201209-144519F_20201210-144040I
            timestamp start/stop: 2020-12-10 14:40:40 +0000 UTC / 2020-12-10 14:40:50 +0000 UTC
            wal start/stop: 000000020000000000000006 / 000000020000000000000006
            database size: 31.0MiB, backup size: 221.4KiB
            repository size: 3.8MiB, repository backup size: 28.0KiB
            backup reference list: 20201209-144519F
  1. Expire the backup
    
    kubectl exec -it lion-backrest-shared-repo-84c5fc44d7-mzvms -- pgbackrest expire --set=20201209-144519F_20201210-144040I

WARN: option 'repo1-retention-full' is not set for 'repo1-retention-full-type=count', the repository may run out of space HINT: to retain full backups indefinitely (without warning), set option 'repo1-retention-full' to the maximum. WARN: expiring latest backup 20201209-144519F_20201210-144040I - the ability to perform point-in-time-recovery (PITR) may be affected HINT: non-default settings for 'repo1-retention-archive'/'repo1-retention-archive-type' (even in prior expires) can cause gaps in the WAL.


3. See that it is expired:

pgo show backup lion

cluster: lion storage type: local

stanza: db status: ok cipher: none

db (current)
    wal archive min/max (13-1)

    full backup: 20201209-144519F
        timestamp start/stop: 2020-12-09 14:45:19 +0000 UTC / 2020-12-09 14:45:29 +0000 UTC
        wal start/stop: 000000010000000000000002 / 000000010000000000000002
        database size: 31.0MiB, backup size: 31.0MiB
        repository size: 3.8MiB, repository backup size: 3.8MiB
        backup reference list: 
alrooney commented 3 years ago

Awesome!! Thanks much. Yes - understand all warnings :-) We also have full backups in s3 so ok losing them in the backrest repo.

alrooney commented 3 years ago

Any suggestions?

 $ kk exec -ti retroelk-prod1-azure-backrest-shared-repo-9cd4f6464-4bcm8 -- pgbackrest expire --repo1-retention-full=7
ERROR: [041]: unable to open file '/backrestrepo/retroelk-backrest-shared-repo/backup/db/backup.info' for write: [28] No space left on device
command terminated with exit code 41
jkatz commented 3 years ago

Clear ephemeral files (not WAL logs) and/or resize the PVC.

alrooney commented 3 years ago

Yeah - I get nervous deleteing stuff out of backrest repo because I know it does checksums. So what are the ephemeral files that are safe to delete in backrest repo?

alrooney commented 3 years ago

Again - I'm not worried about losing backups because I have full backups in S3. My main concern is that I don't want to mess up the cluster so that the backrest repo is non functional. That is actually my current emergency... because the backrest repo is full my primary db disk is rapidly filling up with wals. I'd be fine blowing away the whole repo as I have backups in S3 but I don't want to have to rebuild the cluster or impact the production db in any way.

alrooney commented 3 years ago

re resizing PVC I'm also working on that but would be simplest if I could just delete some data from the backup repo and set my retention policy. Actually funny story... the only reason the retention policy is so high on this repo is because I want to keep more backups in S3 and there is no way with the operator (that I know of) to set a different retention policy or different schedule (which would let me set a different retention policy) for local vs s3 backups. I'll have to file a separate feature request on that :-)

alrooney commented 3 years ago

Ok managed to delete some files and have the backrest repo with disk space but WAL is still growing on primary db. Is there anything we need to restart on the primary db to make sure WAL files are getting written and cleared off the primary?

alrooney commented 3 years ago

Where can I look at the logs for the wal exporter?

jkatz commented 3 years ago

Flagging for possible documentation additions around deleting backups as well as potentially adding ability to explicitly delete a backup from the Operator CLI.

jkatz commented 3 years ago

An explicit pgo delete backup command that deletes pgBackRest backups is now added to the Operator and will be in the 4.6 release.

vit-h commented 1 year ago

@jkatz please add pgo delete backup to the new 5+ client