Modify retention policies so incremental backups are ignored

EnterpriseDB / barman

Barman - Backup and Recovery Manager for PostgreSQL

https://www.pgbarman.org/

GNU General Public License v3.0

2.07k stars 191 forks source link

Modify retention policies so incremental backups are ignored #950

Closed andremagui closed 3 months ago

andremagui commented 3 months ago

Starting from Postgres version 17, the introduction of incremental backups ensures a continuous chain of backups beginning with a full backup and extending through several incremental backups, specifically when backup_method="postgres". Consequently, any retention policy should exclusively target full backups for deletion, as incremental backups rely on their parent chain of backups to remain valid and usable.

This commit:

Add new method to RetentionPolicy class 1.1 propagate_retention_status_to_children

This new method is used to set the retention status for incremental backups. This is needed because we want to control the retention status of the incremental backup based on its parent backup (AKA full backup). This will propagate the parent retention status to its children incremental backups. For KEEP status of the parent, we propagate the VALID status to the incrementals since the incrementals will always depend on the full backup.

Modify method _backup_report of RecoveryWindowRetentionPolicy and RedundancyRetentionPolicy

These were modified so we could add the logic for skipping incremental backups and only account full backups for retention policy. Whatever status the full backup is reported, the value is propagated to its children incremental backups (KEEP status for parent will propagate VALID for incrementals).

Add properties to infofile.py LocalBackupInfo 3.1 is_incremental

This property was created to verify if the BackupInfo is an incremental backup.

3.2 has_children
This property was created to verify if the BackupInfo has any children (incremental backups).

Add new parameter to walk_backups_tree in infofile.py

This parameter was added so we can control whether to return the current backup info (self) calling the method.

References: BAR-166

Signed-off-by: Andre andre.marchesini@enterprisedb.com

andremagui commented 3 months ago

Since this ticket is about ignoring incremental backups, i thought about implementing this in the retention_policies.py instead of doing it in the backup.py -> cron_retention_policy() method.

barthisrael commented 3 months ago

I tested this and the implementation is broken. I took the following backups:

$ egrep "children|parent" */backup.info
20240627T001815/backup.info:children_backup_ids=20240627T001833
20240627T001815/backup.info:parent_backup_id=None
20240627T001833/backup.info:children_backup_ids=20240627T001836
20240627T001833/backup.info:parent_backup_id=20240627T001815
20240627T001836/backup.info:children_backup_ids=None
20240627T001836/backup.info:parent_backup_id=20240627T001833
20240627T001840/backup.info:children_backup_ids=20240627T001846
20240627T001840/backup.info:parent_backup_id=None
20240627T001846/backup.info:children_backup_ids=20240627T001849
20240627T001846/backup.info:parent_backup_id=20240627T001840
20240627T001849/backup.info:children_backup_ids=None
20240627T001849/backup.info:parent_backup_id=20240627T001846
20240627T001852/backup.info:children_backup_ids=20240627T002214
20240627T001852/backup.info:parent_backup_id=None
20240627T002214/backup.info:children_backup_ids=20240627T002217
20240627T002214/backup.info:parent_backup_id=20240627T001852
20240627T002217/backup.info:children_backup_ids=None
20240627T002217/backup.info:parent_backup_id=20240627T002214

When I attempt to run barman list-backup to check their status, the command fails:

$ barman list-backup pg17
EXCEPTION: '20240627T002217'
See log file for more details.

The logs:

$ tail -10 /var/log/barman/barman.log
2024-06-27 00:27:41,768 [154975] barman.cli ERROR: '20240627T002217'
See log file for more details.
Traceback (most recent call last):
  File "/home/vagrant/Repositories/edb/barman/barman/cli.py", line 2416, in main
    args.func(args)
  File "/home/vagrant/Repositories/edb/barman/barman/cli.py", line 595, in list_backups
    server.list_backups()
  File "/home/vagrant/Repositories/edb/barman/barman/server.py", line 1770, in list_backups
    and retention_status[backup.backup_id] != BackupInfo.VALID
KeyError: '20240627T002217'

The problem is Server.list_backup is comparing two different lists of backups:

backups variable contains them all;
retention_status variable contains only the full;

I think we should change the implementation here.

Instead of using source = self.server.full_backups in barman.retention_policies.RetentionPolicy.report, I think we should still use all of the backups. However, when it comes to actually taking them into consideration for retention policies, we should only care about full backups.

Said in other words:

Ignore incremental backups when checking for redundancy or recovery window, and take only full backups into consideration for that purpose;
Whenever a full backup is considered obsolete by the retention policy, mark all of its children in the tree as obsolete too.

Thoughts @andremagui @gcalacoci ?

gcalacoci commented 3 months ago

I think we should change the implementation here.

Instead of using source = self.server.full_backups in barman.retention_policies.RetentionPolicy.report, I think we should still use all of the backups. However, when it comes to actually taking them into consideration for retention policies, we should only care about full backups.

Said in other words:

Ignore incremental backups when checking for redundancy or recovery window, and take only full backups into consideration for that purpose;

Whenever a full backup is considered obsolete by the retention policy, mark all of its children in the tree as obsolete too.

Yeah totally makes sense and I agree

andremagui commented 3 months ago

I tested this and the implementation is broken. I took the following backups:
$ egrep "children|parent" */backup.info
20240627T001815/backup.info:children_backup_ids=20240627T001833
20240627T001815/backup.info:parent_backup_id=None
20240627T001833/backup.info:children_backup_ids=20240627T001836
20240627T001833/backup.info:parent_backup_id=20240627T001815
20240627T001836/backup.info:children_backup_ids=None
20240627T001836/backup.info:parent_backup_id=20240627T001833
20240627T001840/backup.info:children_backup_ids=20240627T001846
20240627T001840/backup.info:parent_backup_id=None
20240627T001846/backup.info:children_backup_ids=20240627T001849
20240627T001846/backup.info:parent_backup_id=20240627T001840
20240627T001849/backup.info:children_backup_ids=None
20240627T001849/backup.info:parent_backup_id=20240627T001846
20240627T001852/backup.info:children_backup_ids=20240627T002214
20240627T001852/backup.info:parent_backup_id=None
20240627T002214/backup.info:children_backup_ids=20240627T002217
20240627T002214/backup.info:parent_backup_id=20240627T001852
20240627T002217/backup.info:children_backup_ids=None
20240627T002217/backup.info:parent_backup_id=20240627T002214
When I attempt to run barman list-backup to check their status, the command fails:
$ barman list-backup pg17
EXCEPTION: '20240627T002217'
See log file for more details.
The logs:
$ tail -10 /var/log/barman/barman.log
2024-06-27 00:27:41,768 [154975] barman.cli ERROR: '20240627T002217'
See log file for more details.
Traceback (most recent call last):
  File "/home/vagrant/Repositories/edb/barman/barman/cli.py", line 2416, in main
    args.func(args)
  File "/home/vagrant/Repositories/edb/barman/barman/cli.py", line 595, in list_backups
    server.list_backups()
  File "/home/vagrant/Repositories/edb/barman/barman/server.py", line 1770, in list_backups
    and retention_status[backup.backup_id] != BackupInfo.VALID
KeyError: '20240627T002217'
The problem is Server.list_backup is comparing two different lists of backups:

backups variable contains them all;

retention_status variable contains only the full;

I think we should change the implementation here.

Instead of using source = self.server.full_backups in barman.retention_policies.RetentionPolicy.report, I think we should still use all of the backups. However, when it comes to actually taking them into consideration for retention policies, we should only care about full backups.

Said in other words:

Ignore incremental backups when checking for redundancy or recovery window, and take only full backups into consideration for that purpose;

Whenever a full backup is considered obsolete by the retention policy, mark all of its children in the tree as obsolete too.

Thoughts @andremagui @gcalacoci ?

Makes total sense and it is safer!

barthisrael commented 3 months ago

Something seems still broken here.

On master:

$ barman list-backups pg17
pg17 20240628T000532 'yet_another_full' - Thu Jun 27 22:05:34 2024 - Size: 251.7 MiB - WAL Size: 0 B - WAITING_FOR_WALS
pg17 20240628T000524 'another_incremental' - Thu Jun 27 22:05:26 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000511 'another_full' - Thu Jun 27 22:05:13 2024 - Size: 251.7 MiB - WAL Size: 32.0 MiB

On the dev branch:

$ barman list-backups pg17
pg17 20240628T000532 'yet_another_full' - Thu Jun 27 22:05:34 2024 - Size: 251.7 MiB - WAL Size: 0 B - WAITING_FOR_WALS
EXCEPTION: '20240628T000524'
See log file for more details.

$ tail -10 /var/log/barman/barman.log
2024-06-28 03:11:04,095 [28079] barman.cli ERROR: '20240628T000524'
See log file for more details.
Traceback (most recent call last):
  File "/home/vagrant/Repositories/edb/barman/barman/cli.py", line 2416, in main
    args.func(args)
  File "/home/vagrant/Repositories/edb/barman/barman/cli.py", line 595, in list_backups
    server.list_backups()
  File "/home/vagrant/Repositories/edb/barman/barman/server.py", line 1762, in list_backups
    and retention_status[backup.backup_id] != BackupInfo.VALID
KeyError: '20240628T000524'

barthisrael commented 3 months ago

The issue mentioned through the previous reply occurs as soon as I have the first incremental backup.

gcalacoci commented 3 months ago

I added some comments to the PR. overall I do feel that we should sorta "restart" from the drawing board. I share @barthisrael sensation about this being inefficient and buggy and sometimes just starting over is the way to go

andremagui commented 3 months ago

I have made the changes according to the reviews! I will rebase everything afterwards when all is OK! @barthisrael @gcalacoci I will still look into some tests that might need to be added to this

andremagui commented 3 months ago

Something seems still broken here.

On master:

$ barman list-backups pg17
pg17 20240628T000532 'yet_another_full' - Thu Jun 27 22:05:34 2024 - Size: 251.7 MiB - WAL Size: 0 B - WAITING_FOR_WALS
pg17 20240628T000524 'another_incremental' - Thu Jun 27 22:05:26 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000511 'another_full' - Thu Jun 27 22:05:13 2024 - Size: 251.7 MiB - WAL Size: 32.0 MiB

On the dev branch:

$ barman list-backups pg17
pg17 20240628T000532 'yet_another_full' - Thu Jun 27 22:05:34 2024 - Size: 251.7 MiB - WAL Size: 0 B - WAITING_FOR_WALS
EXCEPTION: '20240628T000524'
See log file for more details.

$ tail -10 /var/log/barman/barman.log
2024-06-28 03:11:04,095 [28079] barman.cli ERROR: '20240628T000524'
See log file for more details.
Traceback (most recent call last):
  File "/home/vagrant/Repositories/edb/barman/barman/cli.py", line 2416, in main
    args.func(args)
  File "/home/vagrant/Repositories/edb/barman/barman/cli.py", line 595, in list_backups
    server.list_backups()
  File "/home/vagrant/Repositories/edb/barman/barman/server.py", line 1762, in list_backups
    and retention_status[backup.backup_id] != BackupInfo.VALID
KeyError: '20240628T000524'

Hey this bug is related to the fact that list backups uses the report in the beginning and it checks the available_backups with the ones that were reported. So actually we need to add all incrementals as well in the report or change the list_backups method from Server. Or even create another retention status for incrementals that are not touched and add them to the report. What are your thoughts on this?

barthisrael commented 3 months ago

Have you by chance applied changes locally and forgot to commit? I can see several conversations marked as resolved, but no actual changes to the code.

Besides that, I'd like to suggest a couple of things:

If we send a suggestion through the PR, you can use the Add suggestion to batch button of each of them followed by Commit suggestions button to apply them all, and git pull to fetch the commit created by GitHub. That avoids having to write the same code again on your local repo and tends to be faster for you.
Avoid forced pushes in between the reviews. When you force push, that makes it harder for us to re-review the code. Let's only use force pushes when rebasing the branch on top of another branch, or at the end of the PR, right before merging it.

barthisrael commented 3 months ago

The implementation itself seems to be working fine, at least with a redundancy retention policy.

For example, with 4 full backups, one of them with KEEP, and retentionredundancy 2:

$ barman list-backup pg17
pg17 20240702T224908 'yet_another_incr_backup' - Tue Jul  2 20:49:10 2024 - Size: 35.3 MiB - WAL Size: 0 B
pg17 20240702T224830 'yet_another_full_backup' - Tue Jul  2 20:48:33 2024 - Size: 251.2 MiB - WAL Size: 32.0 MiB
pg17 20240702T224631 'another_incr_backup_1' - Tue Jul  2 20:46:32 2024 - Size: 35.3 MiB - WAL Size: 48.0 MiB
pg17 20240702T224353 'another_full_backup' - Tue Jul  2 20:43:55 2024 - Size: 251.2 MiB - WAL Size: 32.0 MiB
pg17 20240702T224345 'incr_backup_2' - Tue Jul  2 20:43:47 2024 - Size: 35.3 MiB - WAL Size: 32.0 MiB - OBSOLETE
pg17 20240702T224339 'incr_backup_1' - Tue Jul  2 20:43:41 2024 - Size: 35.3 MiB - WAL Size: 32.0 MiB - OBSOLETE
pg17 20240702T224322 'full_backup' - Tue Jul  2 20:43:24 2024 - Size: 251.2 MiB - WAL Size: 32.0 MiB - OBSOLETE
pg17 20240702T223430 'child2.1' - Tue Jul  2 20:34:32 2024 - Size: 35.3 MiB - WAL Size: 48.0 MiB
pg17 20240702T223422 'child2' - Tue Jul  2 20:34:23 2024 - Size: 35.3 MiB - WAL Size: 32.0 MiB
pg17 20240702T223414 'child1.1.2' - Tue Jul  2 20:34:15 2024 - Size: 35.3 MiB - WAL Size: 32.0 MiB
pg17 20240702T223408 'child1.1.1' - Tue Jul  2 20:34:10 2024 - Size: 35.3 MiB - WAL Size: 32.0 MiB
pg17 20240702T223359 'child1.1' - Tue Jul  2 20:34:00 2024 - Size: 35.3 MiB - WAL Size: 48.0 MiB
pg17 20240702T223325 'child1' - Tue Jul  2 20:33:26 2024 - Size: 35.3 MiB - WAL Size: 32.0 MiB
pg17 20240702T223314 'root' - Tue Jul  2 20:33:16 2024 - Size: 251.2 MiB - WAL Size: 32.0 MiB - KEEP:STANDALONE

When barman cron runs, it deletes the OBSOLETE backups:

$ barman cron
Starting WAL archiving for server pg17
Enforcing retention policy: removing backup 20240702T224322 for server pg17
Deleting backup 20240702T224322 for server pg17
Deleted backup 20240702T224322 (start time: Tue Jul  2 23:02:19 2024, elapsed time: less than one second)
Enforcing retention policy: removing backup 20240702T224339 for server pg17
Deleting backup 20240702T224339 for server pg17
Deleted backup 20240702T224339 (start time: Tue Jul  2 23:02:19 2024, elapsed time: less than one second)
Enforcing retention policy: removing backup 20240702T224345 for server pg17
Deleting backup 20240702T224345 for server pg17
Deleted backup 20240702T224345 (start time: Tue Jul  2 23:02:19 2024, elapsed time: less than one second)

Tomorrow I'll take a look at unit tests again. I'm fine about the code changes.