Closed kupachupa closed 3 years ago
Oh - for the sake of context + understanding instance & environment: this is running on GKE, on a 6CPU / 32Gi instance w/ limits at 7cpu / 58Gi, of v4.6.1.
This command was used to create the cluster:
pgo create cluster sample-db \
--pgbouncer \
--pgbouncer-replicas=2 \
--pvc-size="5Ti" \
--storage-config="balancedcsireclaimable" \
--replica-storage-config="balancedcsireclaimable" \
--cpu="6" \
--cpu-limit="7" \
--memory="32Gi" \
--memory-limit="58Gi" \
--pgbackrest-storage-type="local" \
--pgbackrest-pvc-size="5Ti" \
--pgbackrest-storage-config="balancedcsireclaimable" \
--toleration=pool=giants-lance:NoSchedule
Here is the dcs-config
:
postgresql:
parameters:
archive_command: source /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-set-env.sh && pgbackrest archive-push "%p"
archive_mode: true
archive_timeout: 60
jit: false
log_directory: pg_log
log_min_duration_statement: 60000
log_temp_files: 0
log_statement: none
max_wal_senders: 6
shared_buffers: 8GB
effective_cache_size: 24GB
maintenance_work_mem: 2GB
shared_preload_libraries: pgaudit.so,pg_stat_statements.so,pgnodemx.so
temp_buffers: 8MB
unix_socket_directories: /tmp
wal_level: logical
checkpoint_completion_target: 0.7
default_statistics_target: 100
random_page_cost: 1.1
wal_buffers: 16MB
effective_io_concurrency: 200
work_mem: 41943kB
min_wal_size: 1GB
max_wal_size: 4GB
max_worker_processes: 4
max_parallel_workers_per_gather: 2
max_parallel_workers: 4
max_parallel_maintenance_workers: 2
autovacuum_max_workers: 5
vacuum_cost_limit: 2000
recovery_conf:
restore_command: source /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-set-env.sh && pgbackrest archive-get %f "%p"
use_pg_rewind: true
The pgBackRest docs cover how to stop the pgBackRest processes, i.e. through the pgBackRest stop
command. PGO does not provide an interface to do that; you would have to exec into the pgBackRest Pod and run that command. Shutting down your Postgres cluster with the command you present would not yield the outcome of stopping the backup, and potentially fits into the category of "really bad things"
If a backup is taking "too long" but is still running, you likely need to provide more CPU (and possibly memory) resources to the backup Job and increase the value of the pgBackRest process-max
flag. pgBackRest was specifically designed for TB size systems, but at the scale it does require turning some of its performance optimization knobs.
However, if you need production support, you should really consult the Support page.
I don't think I said "too long" anywhere in my text, not sure who you are quoting. This isn't a qualitative critique.
Regarding backup & provisioning of resources on job objects -- this is part of the issue. There is no running Job
object in the cluster as of now; whenever one is spun up, it is immediately killed due to conflict with that same long-running backup. This running backup does not correspond to any existing Job
object active in the cluster.
I've read the stop
command documentation & my understanding of it is that it is meant to shut down all running backrest processes -- my desire is to stop a single backup while retaining the WAL archival process, which still continues unimpeded & as intended. Is it your assertion that there wouldn't be notable consequences to issuing stop
and start
commands on the backrest-repo
deployment, and that I might do so w/o harming the continual operation of the database, or of harm being done to the WAL archival process?
As for your advice on process-max
argument in my scheduled jobs -- I already do this & it works excellent, was a really nice get on your most recent updates, many thanks for adding that (for those struggling w/ performance on large dbs, I use the following arguments: --backup-type=pgbackrest --backup-opts='--type=full --protocol-timeout=604000 --archive-timeout=86000 --db-timeout=600000 --io-timeout=3599 --process-max=3 --repo1-retention-full=2 --stop-auto'
). Unfortunately this backup seems to have been issued at startup & not by one of my scheduled jobs, so none of these opts are applied on the default backup command.
The backup's seemingly been running for 2 days now
is where I derived "too long" as IMHO that is a long time for a backup to run :wink:
After a downtime event, PGO will automatically take a full backup to ensure that the new archive timeline is represented in a backup.
I didn't interpret this as a "critique" rather was just attempting to provide helpful pointers. It seems like you have already derived most, if not all, of the helpful bits on your own.
Gotcha! I appreciate the pointers & clarification / information you've been able to provide.
Regarding a downtime event & subsequent auto-triggered backup - to your knowledge, is there a way to either postpone / cancel that initial backup, or manipulate the options which are present in that backup request?
Lastly - hoping to get your thoughts on this question from the last post:
Is it your assertion that there wouldn't be any notable consequences to issuing
stop
andstart
commands on thebackrest-repo
deployment, and that I might do so w/o harming the continual operation of the database, or of harm being done to the WAL archival process?
OK - to add another detail you might find interesting: I would expect the output of pgo show backup
and pgbackrest info
to be the same -- but there's a divergence & it is on the status
field:
Here is the output for the pgo command.
% pgo show backup sample-db
cluster: sample-db
storage type: posix
stanza: db
status: ok
cipher: none
db (current)
...
and here's the output for the pgbackrest
one:
% kubectl exec -n pgo -it svc/sample-db-backrest-shared-repo -- /bin/bash
% pgbackrest info
stanza: db
status: ok (backup/expire running)
cipher: none
db (current)
...
I'm not entirely sure why pgo
would see a different status, and given that any subsequent attempt to run a backup gets a unable to acquire lock on file '/tmp/pgbackrest/db-backup.lock':
error, I would imagine backrest's status is accurate. I had assumed pgo
to essentially be piping the output of that same pgbackrest info
command to the user - what is the difference between pgo show backup
and pgbackrest info
, and what might the divergence in output point to?
I am running a large instance (~2TB) of Postgres 13, on 4.6.1. pgBackRest seems to often place a lock on
'/tmp/pgbackrest/db-backup.lock'
, which seemingly other pgBackRest commands from being able to run, including backup attempts & attempts to delete old backups.The error text:
The backup's seemingly been running for 2 days now according to
ps
on the backrest repo pod, and the time coincides with when I most recently ranpgo update cluster sample-db --shutdown
, thenpgo update cluster sample-db --startup
to try and cancel a similarly long-running backup attempt.How would you recommend debugging this? How can I go about ending a backrest process in a production backrest deployment, without potentially causing harm to the cluster's continued operation? I've read through other issues & "restarting pgBackRest" comes up as a suggestion -- but
pgo
doesn't seem to have a restart command for backrest (no worries at all -- can't expect the operator to do everything). Given the interop going on here, I wanted to inquire as to the proper troubleshooting path before I FUBAR my production db / backup solution.Thank you! Love the project & intend on supporting to the best of my ability.