Open pru-anixe opened 1 year ago
@pru-anixe thanks for the detailed bug report - I think Barman could do better here by calling checkpoint;
before it starts trying to switch the WAL on the primary. That would cause a checkpoint to be created even if there has been no activity - the subsequent pg_switch_wal()
call would then switch to a new WAL and allow the backup to complete.
This should probably be an optional behaviour enabled by a new server option, since forcing a checkpoint is unlikely to be the right thing to do for a busy primary.
@mikewallace1979 that's fine for me. I'd see this as a kind of timeout option. Like force checkpoint if specified time has passed and no new WAL arrived. This could come with suggestion that value should be greater than archive_timeout value of psql config
also, I see now that, if I force change on database, backup goes into waiting_for_wals stage and stays there forever I guess
barman list-backup all
standby_host 20221215T133240 - Thu Dec 15 12:38:55 2022 - Size: 15.4 GiB - WAL Size: 96.0 MiB
standby_host 20221216T130948 - Fri Dec 16 13:09:50 2022 - Size: 2.8 GiB - WAL Size: 0 B - WAITING_FOR_WALS
standby_host 20221216T110848 - Fri Dec 16 11:08:50 2022 - Size: 2.8 GiB - WAL Size: 48.0 MiB - WAITING_FOR_WALS
That's not expected behaviour - the state should change to DONE
when the next WAL is archived, so at the very least the previous backup should enter a DONE
state when a new backup is taken.
Can you run barman cron
with debug logging enabled? There should be some lines in the log such as Checking backup 20221216T110848 of server standby_host
and Check finished: the status of backup ...
which should help figure out what is going on here.
here's how it looks like after few days on a database with no traffic
standby_host 20221218T093103 - STARTED
standby_host 20221217T103059 - STARTED
standby_host 20221216T130948 - Fri Dec 16 13:09:50 2022 - Size: 2.8 GiB - WAL Size: 0 B - WAITING_FOR_WALS
standby_host 20221216T110848 - Fri Dec 16 11:08:50 2022 - Size: 2.8 GiB - WAL Size: 48.0 MiB - WAITING_FOR_WALS
but.. after running cron command statuses changed
2022-12-18 20:08:00,813 [74528] barman.wal_archiver INFO: No xlog segments found from file archival for standby_host.
2022-12-18 20:08:00,818 [74529] barman.server DEBUG: Check finished: the status of backup 20221216T130948 of server standby_host changed from WAITING_FOR_WALS to DONE
Ok that does sound like expected behaviour - the barman cron
job checks the status of all backups against the WALs in the archive and updates it accordingly.
The Barman rpm and deb packages install a cron job which runs barman cron
every 60 seconds so you might want to set up something similar for your installation, potentially with a longer interval than 60s.
Thanks for clarification. For sure initial installation was via apt but also some upgrades happened via pip. Maybe this has caused cron entry to be missing. I reinstalled it via apt again yesterday and today I don't see any 'WAITING_FOR_WALS' backups.
I think that executing a checkpoint is not the right solution in general. I don't know how we could fix this without harming other workloads. In general, I would suggest "not" taking a backup of an inactive server. There is nothing to backup. Maybe we can add an option that checks LSNs and skips a backup if there's been zero activity since the last Backup
Hello, according to documentation
This actually works when there is at least little traffic. Backup waits endlessly when there is NO traffic at all. I have configured backup on standby and it works great for instance with normal traffic but on one cluster we have long periods of time with zero changes on database which causes
pg_switch_wal()
to do absolutely nothing.According to psql documentation:
also with archive_timeout set to non-zero value
backup hangs in STARTED state
when I force change on a database then backup completes
executed command
configuration file
psql configuration on standby