galaxyproject / ansible-postgresql

An Ansible role for managing a PostgreSQL server
https://galaxy.ansible.com/galaxyproject/postgresql
122 stars 58 forks source link

Overhaul and modernize backups #30

Closed natefoo closed 2 years ago

natefoo commented 2 years ago

The method that was previously in use has been deprecated since PostgreSQL 9.6 (although still supported as of 14). But we were also incorrectly storing our archived WAL segments inside the backups. I am not sure if it would be possible to recover from the old backups since initiating the backup usually causes the archival of one segment to fail as the current directory is rotated. The new archive script stores these in a separate wal_archive directory. The backup script now excludes a bunch of stuff that the documentation says to exclude.

Unfortunately, the backup script is now written in Python. I really didn't want to do this, but because the start and stop SQL have to be performed in the same database connection, and the backup must be performed in between, it meant I'd have to do a lot of process and filehandle control in shell that I wanted to do even less than writing this in Python. The only non-standard-lib dependency is psycopg2, which can be installed from the PGDG repos on EL and is part of the standard Galaxy Ansible setup anyway since you need it for the postgresql_objects role. The backup script is compatible with Python 2.7 (the default on EL 7) or 3.5+. As such I haven't provided any functionality to create a virtualenv or pip install psycopg2 or anything like that. This script should run using the system Python as long as you install the system psycopg2. If you need to use a different Python you can override postgresql_backup_command to do so.

I could find no documentation saying that you should make backups of the current active WAL segment, so that script and cron job have been dropped. I don't know where I got that idea from, but it's not in the documentation as far as I can tell.

You can now prune old backups - specify the number to keep with postgresql_backup_keep (default 30), there is no time-based option. Remove --keep from the command line by overriding postgresql_backup_command to disable.

The WAL archive will also be pruned so that only WAL segments newer than the oldest backup will be kept, but this functionality only works if the backup dir is a mounted filesystem. Because the rest of the backup functionality depends on rsync and the pruning is done with the standard pg_archivecleanup PostgreSQL utility, I didn't want to add separate options for specifying ssh connection params and then running the utility (which may not be installed on the remote) over SSH. If someone wants to add that feature it'd be welcome. You can also just copy backup.py to somewhere with the archive mounted and run with just --clean-archive (without --backup and --keep), but again, pg_archivecleanup must be available.

We should probably actually test restoring a backup to make sure this is all correct...

PITR Documentation for reference.

natefoo commented 2 years ago

No automated test but I did the following locally:

  1. Take a regular backup using backup.py --backup /tmp/pgbackup
  2. Stop postgres
  3. Nuke the data directory
  4. Restore the data directory with rsync -av /tmp/pgbackup/20211108T212616Z/ /var/lib/postgresql/13/main
  5. Set restore_command = 'cp "/tmp/pgbackup/wal_archive/%f" "%p"'
  6. touch /var/lib/postgresql/13/main/recovery.signal
  7. Start postgres

It recovered successfully and the data was intact.

hexylena commented 2 years ago

That's a pretty comprehensive list of steps, could probs do that in gh actions.

natefoo commented 2 years ago

True. It'd be nice to test generating a WAL during the backup, or at least generating one after the backup, and making sure that gets included.