CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.91k stars 587 forks source link

Stopping a pgBackRest Process #2409

Closed kupachupa closed 3 years ago

kupachupa commented 3 years ago

I am running a large instance (~2TB) of Postgres 13, on 4.6.1. pgBackRest seems to often place a lock on '/tmp/pgbackrest/db-backup.lock', which seemingly other pgBackRest commands from being able to run, including backup attempts & attempts to delete old backups.

The error text:

% pgo delete backup sample-db --target=20210401-215524F

WARNING - If you delete a backup that is *not* set to expire, you may be unable to meet your retention requirements. Proceed? (yes/no): yes
Error: ERROR: [050]: unable to acquire lock on file '/tmp/pgbackrest/db-backup.lock': Resource temporarily unavailable
       HINT: is another pgBackRest process running?

The backup's seemingly been running for 2 days now according to ps on the backrest repo pod, and the time coincides with when I most recently ran pgo update cluster sample-db --shutdown, then pgo update cluster sample-db --startup to try and cancel a similarly long-running backup attempt.

How would you recommend debugging this? How can I go about ending a backrest process in a production backrest deployment, without potentially causing harm to the cluster's continued operation? I've read through other issues & "restarting pgBackRest" comes up as a suggestion -- but pgo doesn't seem to have a restart command for backrest (no worries at all -- can't expect the operator to do everything). Given the interop going on here, I wanted to inquire as to the proper troubleshooting path before I FUBAR my production db / backup solution.

Thank you! Love the project & intend on supporting to the best of my ability.

kupachupa commented 3 years ago

Oh - for the sake of context + understanding instance & environment: this is running on GKE, on a 6CPU / 32Gi instance w/ limits at 7cpu / 58Gi, of v4.6.1.

This command was used to create the cluster:

pgo create cluster sample-db \
    --pgbouncer \
    --pgbouncer-replicas=2 \
    --pvc-size="5Ti" \
    --storage-config="balancedcsireclaimable" \
    --replica-storage-config="balancedcsireclaimable" \
    --cpu="6" \
    --cpu-limit="7" \
    --memory="32Gi" \
    --memory-limit="58Gi" \
    --pgbackrest-storage-type="local" \
    --pgbackrest-pvc-size="5Ti" \
    --pgbackrest-storage-config="balancedcsireclaimable" \
    --toleration=pool=giants-lance:NoSchedule 

Here is the dcs-config:

postgresql:
  parameters:
    archive_command: source /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-set-env.sh && pgbackrest archive-push "%p"
    archive_mode: true
    archive_timeout: 60
    jit: false
    log_directory: pg_log
    log_min_duration_statement: 60000
    log_temp_files: 0
    log_statement: none
    max_wal_senders: 6
    shared_buffers: 8GB
    effective_cache_size: 24GB
    maintenance_work_mem: 2GB
    shared_preload_libraries: pgaudit.so,pg_stat_statements.so,pgnodemx.so
    temp_buffers: 8MB
    unix_socket_directories: /tmp
    wal_level: logical
    checkpoint_completion_target: 0.7
    default_statistics_target: 100
    random_page_cost: 1.1
    wal_buffers: 16MB
    effective_io_concurrency: 200
    work_mem: 41943kB
    min_wal_size: 1GB
    max_wal_size: 4GB
    max_worker_processes: 4
    max_parallel_workers_per_gather: 2
    max_parallel_workers: 4
    max_parallel_maintenance_workers: 2
    autovacuum_max_workers: 5
    vacuum_cost_limit: 2000
  recovery_conf:
    restore_command: source /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-set-env.sh && pgbackrest archive-get %f "%p"
  use_pg_rewind: true
jkatz commented 3 years ago

The pgBackRest docs cover how to stop the pgBackRest processes, i.e. through the pgBackRest stop command. PGO does not provide an interface to do that; you would have to exec into the pgBackRest Pod and run that command. Shutting down your Postgres cluster with the command you present would not yield the outcome of stopping the backup, and potentially fits into the category of "really bad things"

If a backup is taking "too long" but is still running, you likely need to provide more CPU (and possibly memory) resources to the backup Job and increase the value of the pgBackRest process-max flag. pgBackRest was specifically designed for TB size systems, but at the scale it does require turning some of its performance optimization knobs.

However, if you need production support, you should really consult the Support page.

kupachupa commented 3 years ago

I don't think I said "too long" anywhere in my text, not sure who you are quoting. This isn't a qualitative critique.

Regarding backup & provisioning of resources on job objects -- this is part of the issue. There is no running Job object in the cluster as of now; whenever one is spun up, it is immediately killed due to conflict with that same long-running backup. This running backup does not correspond to any existing Job object active in the cluster.

I've read the stop command documentation & my understanding of it is that it is meant to shut down all running backrest processes -- my desire is to stop a single backup while retaining the WAL archival process, which still continues unimpeded & as intended. Is it your assertion that there wouldn't be notable consequences to issuing stop and start commands on the backrest-repo deployment, and that I might do so w/o harming the continual operation of the database, or of harm being done to the WAL archival process?

As for your advice on process-max argument in my scheduled jobs -- I already do this & it works excellent, was a really nice get on your most recent updates, many thanks for adding that (for those struggling w/ performance on large dbs, I use the following arguments: --backup-type=pgbackrest --backup-opts='--type=full --protocol-timeout=604000 --archive-timeout=86000 --db-timeout=600000 --io-timeout=3599 --process-max=3 --repo1-retention-full=2 --stop-auto'). Unfortunately this backup seems to have been issued at startup & not by one of my scheduled jobs, so none of these opts are applied on the default backup command.

jkatz commented 3 years ago

The backup's seemingly been running for 2 days now

is where I derived "too long" as IMHO that is a long time for a backup to run :wink:

After a downtime event, PGO will automatically take a full backup to ensure that the new archive timeline is represented in a backup.

I didn't interpret this as a "critique" rather was just attempting to provide helpful pointers. It seems like you have already derived most, if not all, of the helpful bits on your own.

kupachupa commented 3 years ago

Gotcha! I appreciate the pointers & clarification / information you've been able to provide.

Regarding a downtime event & subsequent auto-triggered backup - to your knowledge, is there a way to either postpone / cancel that initial backup, or manipulate the options which are present in that backup request?

Lastly - hoping to get your thoughts on this question from the last post:

Is it your assertion that there wouldn't be any notable consequences to issuing stop and start commands on the backrest-repo deployment, and that I might do so w/o harming the continual operation of the database, or of harm being done to the WAL archival process?

kupachupa commented 3 years ago

OK - to add another detail you might find interesting: I would expect the output of pgo show backup and pgbackrest info to be the same -- but there's a divergence & it is on the status field:

Here is the output for the pgo command.

% pgo show backup sample-db

cluster: sample-db
storage type: posix

stanza: db
    status: ok
    cipher: none

    db (current)
    ...

and here's the output for the pgbackrest one:

% kubectl exec -n pgo -it svc/sample-db-backrest-shared-repo -- /bin/bash
% pgbackrest info
stanza: db
    status: ok (backup/expire running)
    cipher: none

    db (current)
    ...

I'm not entirely sure why pgo would see a different status, and given that any subsequent attempt to run a backup gets a unable to acquire lock on file '/tmp/pgbackrest/db-backup.lock': error, I would imagine backrest's status is accurate. I had assumed pgo to essentially be piping the output of that same pgbackrest info command to the user - what is the difference between pgo show backup and pgbackrest info, and what might the divergence in output point to?