Closed andremagui closed 3 months ago
Since this ticket is about ignoring incremental backups, i thought about implementing this in the retention_policies.py instead of doing it in the backup.py -> cron_retention_policy() method.
I tested this and the implementation is broken. I took the following backups:
$ egrep "children|parent" */backup.info
20240627T001815/backup.info:children_backup_ids=20240627T001833
20240627T001815/backup.info:parent_backup_id=None
20240627T001833/backup.info:children_backup_ids=20240627T001836
20240627T001833/backup.info:parent_backup_id=20240627T001815
20240627T001836/backup.info:children_backup_ids=None
20240627T001836/backup.info:parent_backup_id=20240627T001833
20240627T001840/backup.info:children_backup_ids=20240627T001846
20240627T001840/backup.info:parent_backup_id=None
20240627T001846/backup.info:children_backup_ids=20240627T001849
20240627T001846/backup.info:parent_backup_id=20240627T001840
20240627T001849/backup.info:children_backup_ids=None
20240627T001849/backup.info:parent_backup_id=20240627T001846
20240627T001852/backup.info:children_backup_ids=20240627T002214
20240627T001852/backup.info:parent_backup_id=None
20240627T002214/backup.info:children_backup_ids=20240627T002217
20240627T002214/backup.info:parent_backup_id=20240627T001852
20240627T002217/backup.info:children_backup_ids=None
20240627T002217/backup.info:parent_backup_id=20240627T002214
When I attempt to run barman list-backup
to check their status, the command fails:
$ barman list-backup pg17
EXCEPTION: '20240627T002217'
See log file for more details.
The logs:
$ tail -10 /var/log/barman/barman.log
2024-06-27 00:27:41,768 [154975] barman.cli ERROR: '20240627T002217'
See log file for more details.
Traceback (most recent call last):
File "/home/vagrant/Repositories/edb/barman/barman/cli.py", line 2416, in main
args.func(args)
File "/home/vagrant/Repositories/edb/barman/barman/cli.py", line 595, in list_backups
server.list_backups()
File "/home/vagrant/Repositories/edb/barman/barman/server.py", line 1770, in list_backups
and retention_status[backup.backup_id] != BackupInfo.VALID
KeyError: '20240627T002217'
The problem is Server.list_backup
is comparing two different lists of backups:
backups
variable contains them all;retention_status
variable contains only the full;I think we should change the implementation here.
Instead of using source = self.server.full_backups
in barman.retention_policies.RetentionPolicy.report
, I think we should still use all of the backups. However, when it comes to actually taking them into consideration for retention policies, we should only care about full backups.
Said in other words:
Thoughts @andremagui @gcalacoci ?
I think we should change the implementation here.
Instead of using source = self.server.full_backups in barman.retention_policies.RetentionPolicy.report, I think we should still use all of the backups. However, when it comes to actually taking them into consideration for retention policies, we should only care about full backups.
Said in other words:
- Ignore incremental backups when checking for redundancy or recovery window, and take only full backups into consideration for that purpose;
- Whenever a full backup is considered obsolete by the retention policy, mark all of its children in the tree as obsolete too.
Yeah totally makes sense and I agree
I tested this and the implementation is broken. I took the following backups:
$ egrep "children|parent" */backup.info 20240627T001815/backup.info:children_backup_ids=20240627T001833 20240627T001815/backup.info:parent_backup_id=None 20240627T001833/backup.info:children_backup_ids=20240627T001836 20240627T001833/backup.info:parent_backup_id=20240627T001815 20240627T001836/backup.info:children_backup_ids=None 20240627T001836/backup.info:parent_backup_id=20240627T001833 20240627T001840/backup.info:children_backup_ids=20240627T001846 20240627T001840/backup.info:parent_backup_id=None 20240627T001846/backup.info:children_backup_ids=20240627T001849 20240627T001846/backup.info:parent_backup_id=20240627T001840 20240627T001849/backup.info:children_backup_ids=None 20240627T001849/backup.info:parent_backup_id=20240627T001846 20240627T001852/backup.info:children_backup_ids=20240627T002214 20240627T001852/backup.info:parent_backup_id=None 20240627T002214/backup.info:children_backup_ids=20240627T002217 20240627T002214/backup.info:parent_backup_id=20240627T001852 20240627T002217/backup.info:children_backup_ids=None 20240627T002217/backup.info:parent_backup_id=20240627T002214
When I attempt to run
barman list-backup
to check their status, the command fails:$ barman list-backup pg17 EXCEPTION: '20240627T002217' See log file for more details.
The logs:
$ tail -10 /var/log/barman/barman.log 2024-06-27 00:27:41,768 [154975] barman.cli ERROR: '20240627T002217' See log file for more details. Traceback (most recent call last): File "/home/vagrant/Repositories/edb/barman/barman/cli.py", line 2416, in main args.func(args) File "/home/vagrant/Repositories/edb/barman/barman/cli.py", line 595, in list_backups server.list_backups() File "/home/vagrant/Repositories/edb/barman/barman/server.py", line 1770, in list_backups and retention_status[backup.backup_id] != BackupInfo.VALID KeyError: '20240627T002217'
The problem is
Server.list_backup
is comparing two different lists of backups:
backups
variable contains them all;retention_status
variable contains only the full;I think we should change the implementation here.
Instead of using
source = self.server.full_backups
inbarman.retention_policies.RetentionPolicy.report
, I think we should still use all of the backups. However, when it comes to actually taking them into consideration for retention policies, we should only care about full backups.Said in other words:
- Ignore incremental backups when checking for redundancy or recovery window, and take only full backups into consideration for that purpose;
- Whenever a full backup is considered obsolete by the retention policy, mark all of its children in the tree as obsolete too.
Thoughts @andremagui @gcalacoci ?
Makes total sense and it is safer!
Something seems still broken here.
$ barman list-backups pg17
pg17 20240628T000532 'yet_another_full' - Thu Jun 27 22:05:34 2024 - Size: 251.7 MiB - WAL Size: 0 B - WAITING_FOR_WALS
pg17 20240628T000524 'another_incremental' - Thu Jun 27 22:05:26 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000511 'another_full' - Thu Jun 27 22:05:13 2024 - Size: 251.7 MiB - WAL Size: 32.0 MiB
$ barman list-backups pg17
pg17 20240628T000532 'yet_another_full' - Thu Jun 27 22:05:34 2024 - Size: 251.7 MiB - WAL Size: 0 B - WAITING_FOR_WALS
EXCEPTION: '20240628T000524'
See log file for more details.
$ tail -10 /var/log/barman/barman.log
2024-06-28 03:11:04,095 [28079] barman.cli ERROR: '20240628T000524'
See log file for more details.
Traceback (most recent call last):
File "/home/vagrant/Repositories/edb/barman/barman/cli.py", line 2416, in main
args.func(args)
File "/home/vagrant/Repositories/edb/barman/barman/cli.py", line 595, in list_backups
server.list_backups()
File "/home/vagrant/Repositories/edb/barman/barman/server.py", line 1762, in list_backups
and retention_status[backup.backup_id] != BackupInfo.VALID
KeyError: '20240628T000524'
The issue mentioned through the previous reply occurs as soon as I have the first incremental backup.
I added some comments to the PR. overall I do feel that we should sorta "restart" from the drawing board. I share @barthisrael sensation about this being inefficient and buggy and sometimes just starting over is the way to go
I have made the changes according to the reviews! I will rebase everything afterwards when all is OK! @barthisrael @gcalacoci I will still look into some tests that might need to be added to this
Something seems still broken here.
- On master:
$ barman list-backups pg17 pg17 20240628T000532 'yet_another_full' - Thu Jun 27 22:05:34 2024 - Size: 251.7 MiB - WAL Size: 0 B - WAITING_FOR_WALS pg17 20240628T000524 'another_incremental' - Thu Jun 27 22:05:26 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB pg17 20240628T000511 'another_full' - Thu Jun 27 22:05:13 2024 - Size: 251.7 MiB - WAL Size: 32.0 MiB
- On the dev branch:
$ barman list-backups pg17 pg17 20240628T000532 'yet_another_full' - Thu Jun 27 22:05:34 2024 - Size: 251.7 MiB - WAL Size: 0 B - WAITING_FOR_WALS EXCEPTION: '20240628T000524' See log file for more details.
$ tail -10 /var/log/barman/barman.log 2024-06-28 03:11:04,095 [28079] barman.cli ERROR: '20240628T000524' See log file for more details. Traceback (most recent call last): File "/home/vagrant/Repositories/edb/barman/barman/cli.py", line 2416, in main args.func(args) File "/home/vagrant/Repositories/edb/barman/barman/cli.py", line 595, in list_backups server.list_backups() File "/home/vagrant/Repositories/edb/barman/barman/server.py", line 1762, in list_backups and retention_status[backup.backup_id] != BackupInfo.VALID KeyError: '20240628T000524'
Hey this bug is related to the fact that list backups uses the report in the beginning and it checks the available_backups with the ones that were reported. So actually we need to add all incrementals as well in the report or change the list_backups method from Server. Or even create another retention status for incrementals that are not touched and add them to the report. What are your thoughts on this?
Have you by chance applied changes locally and forgot to commit? I can see several conversations marked as resolved, but no actual changes to the code.
Besides that, I'd like to suggest a couple of things:
Add suggestion to batch
button of each of them followed by Commit suggestions
button to apply them all, and git pull
to fetch the commit created by GitHub. That avoids having to write the same code again on your local repo and tends to be faster for you.The implementation itself seems to be working fine, at least with a redundancy retention policy.
For example, with 4 full backups, one of them with KEEP
, and retentionredundancy 2
:
$ barman list-backup pg17
pg17 20240702T224908 'yet_another_incr_backup' - Tue Jul 2 20:49:10 2024 - Size: 35.3 MiB - WAL Size: 0 B
pg17 20240702T224830 'yet_another_full_backup' - Tue Jul 2 20:48:33 2024 - Size: 251.2 MiB - WAL Size: 32.0 MiB
pg17 20240702T224631 'another_incr_backup_1' - Tue Jul 2 20:46:32 2024 - Size: 35.3 MiB - WAL Size: 48.0 MiB
pg17 20240702T224353 'another_full_backup' - Tue Jul 2 20:43:55 2024 - Size: 251.2 MiB - WAL Size: 32.0 MiB
pg17 20240702T224345 'incr_backup_2' - Tue Jul 2 20:43:47 2024 - Size: 35.3 MiB - WAL Size: 32.0 MiB - OBSOLETE
pg17 20240702T224339 'incr_backup_1' - Tue Jul 2 20:43:41 2024 - Size: 35.3 MiB - WAL Size: 32.0 MiB - OBSOLETE
pg17 20240702T224322 'full_backup' - Tue Jul 2 20:43:24 2024 - Size: 251.2 MiB - WAL Size: 32.0 MiB - OBSOLETE
pg17 20240702T223430 'child2.1' - Tue Jul 2 20:34:32 2024 - Size: 35.3 MiB - WAL Size: 48.0 MiB
pg17 20240702T223422 'child2' - Tue Jul 2 20:34:23 2024 - Size: 35.3 MiB - WAL Size: 32.0 MiB
pg17 20240702T223414 'child1.1.2' - Tue Jul 2 20:34:15 2024 - Size: 35.3 MiB - WAL Size: 32.0 MiB
pg17 20240702T223408 'child1.1.1' - Tue Jul 2 20:34:10 2024 - Size: 35.3 MiB - WAL Size: 32.0 MiB
pg17 20240702T223359 'child1.1' - Tue Jul 2 20:34:00 2024 - Size: 35.3 MiB - WAL Size: 48.0 MiB
pg17 20240702T223325 'child1' - Tue Jul 2 20:33:26 2024 - Size: 35.3 MiB - WAL Size: 32.0 MiB
pg17 20240702T223314 'root' - Tue Jul 2 20:33:16 2024 - Size: 251.2 MiB - WAL Size: 32.0 MiB - KEEP:STANDALONE
When barman cron
runs, it deletes the OBSOLETE
backups:
$ barman cron
Starting WAL archiving for server pg17
Enforcing retention policy: removing backup 20240702T224322 for server pg17
Deleting backup 20240702T224322 for server pg17
Deleted backup 20240702T224322 (start time: Tue Jul 2 23:02:19 2024, elapsed time: less than one second)
Enforcing retention policy: removing backup 20240702T224339 for server pg17
Deleting backup 20240702T224339 for server pg17
Deleted backup 20240702T224339 (start time: Tue Jul 2 23:02:19 2024, elapsed time: less than one second)
Enforcing retention policy: removing backup 20240702T224345 for server pg17
Deleting backup 20240702T224345 for server pg17
Deleted backup 20240702T224345 (start time: Tue Jul 2 23:02:19 2024, elapsed time: less than one second)
Tomorrow I'll take a look at unit tests again. I'm fine about the code changes.
Starting from Postgres version 17, the introduction of incremental backups ensures a continuous chain of backups beginning with a full backup and extending through several incremental backups, specifically when backup_method="postgres". Consequently, any retention policy should exclusively target full backups for deletion, as incremental backups rely on their parent chain of backups to remain valid and usable.
This commit:
This property was created to verify if the BackupInfo is an incremental backup.
3.2 has_children
This property was created to verify if the BackupInfo has any children (incremental backups).
References: BAR-166
Signed-off-by: Andre andre.marchesini@enterprisedb.com