EnterpriseDB / barman

Barman - Backup and Recovery Manager for PostgreSQL
https://www.pgbarman.org/
GNU General Public License v3.0
2.04k stars 191 forks source link

Backup from standby still waits when there is NO traffic on database #717

Open pru-anixe opened 1 year ago

pru-anixe commented 1 year ago

Hello, according to documentation

It is especially important that primary_conninfo is set if the standby is to be backed up when there is little or no write traffic on the primary. If primary_conninfo is not set then the backup will still run however it will wait at the stop backup stage until the current WAL semgent on the primary is newer than the latest WAL required by the backup.

This actually works when there is at least little traffic. Backup waits endlessly when there is NO traffic at all. I have configured backup on standby and it works great for instance with normal traffic but on one cluster we have long periods of time with zero changes on database which causes pg_switch_wal() to do absolutely nothing.

According to psql documentation:

However, if there has been no activity which generates WAL since the last WAL file switch, a switch will not be carried out and the start location of the current WAL file will be returned.

also with archive_timeout set to non-zero value

 When this parameter is greater than zero, the server will switch to a new segment file whenever this amount of time has elapsed since the last segment file switch, and there has been any database activity

backup hangs in STARTED state

when I force change on a database then backup completes

executed command

barman backup standby_host --reuse-backup=link

configuration file

[standby_host]
active = "True"
description = "Backup of sql prod"
ssh_command = ssh postgres@standby_host
conninfo = "host=standby_host user=barman dbname=postgres"
primary_conninfo = "host=primary_host user=barman dbname=postgres"
backup_method = "rsync"
reuse_backup = "link"
backup_options = "concurrent_backup"
archiver = on
minimum_redundancy = "1"
retention_policy = "REDUNDANCY 7"

psql configuration on standby

archive_command = 'barman-wal-archive barman standby_host %p'
archive_mode = always
wal_level = replica
mikewallace1979 commented 1 year ago

@pru-anixe thanks for the detailed bug report - I think Barman could do better here by calling checkpoint; before it starts trying to switch the WAL on the primary. That would cause a checkpoint to be created even if there has been no activity - the subsequent pg_switch_wal() call would then switch to a new WAL and allow the backup to complete.

This should probably be an optional behaviour enabled by a new server option, since forcing a checkpoint is unlikely to be the right thing to do for a busy primary.

pru-anixe commented 1 year ago

@mikewallace1979 that's fine for me. I'd see this as a kind of timeout option. Like force checkpoint if specified time has passed and no new WAL arrived. This could come with suggestion that value should be greater than archive_timeout value of psql config

pru-anixe commented 1 year ago

also, I see now that, if I force change on database, backup goes into waiting_for_wals stage and stays there forever I guess

barman list-backup all
standby_host 20221215T133240 - Thu Dec 15 12:38:55 2022 - Size: 15.4 GiB - WAL Size: 96.0 MiB
standby_host 20221216T130948 - Fri Dec 16 13:09:50 2022 - Size: 2.8 GiB - WAL Size: 0 B - WAITING_FOR_WALS
standby_host 20221216T110848 - Fri Dec 16 11:08:50 2022 - Size: 2.8 GiB - WAL Size: 48.0 MiB - WAITING_FOR_WALS
mikewallace1979 commented 1 year ago

That's not expected behaviour - the state should change to DONE when the next WAL is archived, so at the very least the previous backup should enter a DONE state when a new backup is taken.

Can you run barman cron with debug logging enabled? There should be some lines in the log such as Checking backup 20221216T110848 of server standby_host and Check finished: the status of backup ... which should help figure out what is going on here.

pru-anixe commented 1 year ago

here's how it looks like after few days on a database with no traffic

standby_host 20221218T093103 - STARTED
standby_host 20221217T103059 - STARTED
standby_host 20221216T130948 - Fri Dec 16 13:09:50 2022 - Size: 2.8 GiB - WAL Size: 0 B - WAITING_FOR_WALS
standby_host 20221216T110848 - Fri Dec 16 11:08:50 2022 - Size: 2.8 GiB - WAL Size: 48.0 MiB - WAITING_FOR_WALS

but.. after running cron command statuses changed

2022-12-18 20:08:00,813 [74528] barman.wal_archiver INFO: No xlog segments found from file archival for standby_host.
2022-12-18 20:08:00,818 [74529] barman.server DEBUG: Check finished: the status of backup 20221216T130948 of server standby_host changed from WAITING_FOR_WALS to DONE
mikewallace1979 commented 1 year ago

Ok that does sound like expected behaviour - the barman cron job checks the status of all backups against the WALs in the archive and updates it accordingly.

The Barman rpm and deb packages install a cron job which runs barman cron every 60 seconds so you might want to set up something similar for your installation, potentially with a longer interval than 60s.

pru-anixe commented 1 year ago

Thanks for clarification. For sure initial installation was via apt but also some upgrades happened via pip. Maybe this has caused cron entry to be missing. I reinstalled it via apt again yesterday and today I don't see any 'WAITING_FOR_WALS' backups.

martinmarques commented 1 month ago

I think that executing a checkpoint is not the right solution in general. I don't know how we could fix this without harming other workloads. In general, I would suggest "not" taking a backup of an inactive server. There is nothing to backup. Maybe we can add an option that checks LSNs and skips a backup if there's been zero activity since the last Backup