EnterpriseDB / barman

Barman - Backup and Recovery Manager for PostgreSQL
https://www.pgbarman.org/
GNU General Public License v3.0
1.93k stars 185 forks source link

Barman backup failed | unable to create backup_label #607

Open r-chauhan16 opened 2 years ago

r-chauhan16 commented 2 years ago

Hi I am trying to take a backup on Azure blob storage mounted on Linux RHEL7. Wals are shipping correctly. but backup fails with error - unable to create backup_label.

BARMAN - 2.19 RHEL -7.9 Postgresql - 14.4

barman check server-powa Server server-powa: PostgreSQL: OK superuser or standard user with backup privileges: OK wal_level: OK directories: OK retention policy settings: OK backup maximum age: OK (no last_backup_maximum_age provided) backup minimum size: OK (0 B) wal maximum age: FAILED (No WAL files archived for last backup) wal size: OK (0 B) compression settings: OK failed backups: FAILED (there are 2 failed backups) minimum redundancy requirements: FAILED (have 0 backups, expected at least 2) ssh: OK (PostgreSQL server) systemid coherence: OK archive_mode: OK archive_command: OK continuous archiving: OK archiver errors: OK

barman backup server-powa

Starting backup using rsync-concurrent method for server server-powa in /barman-backup/server-powa/base/20220620T172302 Backup start at LSN: 11/250000D8 (000000010000001100000025, 000000D8) This is the first backup for server server-powa ERROR: The backup has failed starting backup Asking PostgreSQL server to finalize the backup. ERROR: Backup failed writing backup label. DETAILS: [Errno 2] No such file or directory: '/barman-backup/server-powa/base/20220620T172302/data/backup_label' Processing xlog segments from file archival for server-powa 000000010000001100000024 000000010000001100000025 000000010000001100000025.000000D8.backup EXCEPTION: [Errno 2] No such file or directory See log file for more details.

Log file errors -----

2022-06-20 17:23:04,074 [1882] barman.wal_archiver DEBUG: Archived 1 out of 1 xlog segments from file archival for server-powa 2022-06-20 17:23:05,131 [1844] barman.backup_executor DEBUG: Writing backup label: /barman-backup/server-powa/base/20220620T172302/data/backup_label 2022-06-20 17:23:05,136 [1844] barman.backup ERROR: Backup failed writing backup label. DETAILS: [Errno 2] No such file or directory: '/barman-backup/server-powa/base/20220620T172302/data/backup_label' 2022-06-20 17:23:05,248 [1844] barman.server DEBUG: Starting archive-wal for server server-powa 2022-06-20 17:23:05,256 [1844] barman.wal_archiver INFO: Found 3 xlog segments from file archival for server-powa. Archive all segments in one run.

Configuration information - barman_home = /barman-backup log_file = /home/barman/logs/barman.log last_wal_maximum_age = 2 MONTHS minimum_redundancy = 2 retention_policy = RECOVERY WINDOW OF 4 WEEKS

Regards

mikewallace1979 commented 2 years ago

I suspect the failure to write the backup label is related to the first error: ERROR: The backup has failed starting backup. Could you post the full log from the backup? It would be useful to see all the DEBUG logging to figure out which step of the starting backup action is failing.

r-chauhan16 commented 2 years ago

All logs from when I started the backup. barman.log

I believe the below warnings are ignorable, as I added the server. Then as part of troubleshooting, I removed all the configuration from the barman and added the server back. Warnings: Ignoring deletion of WAL file

Regards

mikewallace1979 commented 2 years ago

Unfortunately I think the original exception is getting buried because of the exception which occurs when attempting to write the backup label in the finally block - this causes a new exception to be raised which is then caught by the calling code, meaning the original exception which caused the backup to be terminated in the first place is never re-raised.

That said, we can work out where Barman failed based on the debug log statements which are not present - we see the debug line logged at L791 but we know current_action does not get updated on L797. We see the This is the first backup for server in _purge_unused_wal_files, then the various WARNING lines as it tries to delete WALs which it can't find, and then the The backup has failed line. So chances are something went wrong during _purge_unused_wal_files.

Interestingly, during the post-backup code which happens after the handling of backup exceptions, there as an additional removal of wal files which uses the same remove_wal_before_backup function that _purge_unused_wal files uses. Here we have a full stack trace for the error:

2022-06-20 18:17:43,340 [14735] barman.cli ERROR: [Errno 2] No such file or directory
See log file for more details.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/barman/cli.py", line 1910, in main
    args.func(args)
  File "/usr/lib/python2.7/site-packages/barman/cli.py", line 451, in backup
    server.backup(wait=args.wait, wait_timeout=args.wait_timeout)
  File "/usr/lib/python2.7/site-packages/barman/server.py", line 1437, in backup
    self.backup_manager.remove_wal_before_backup(backup_info)
  File "/usr/lib/python2.7/site-packages/barman/backup.py", line 1219, in remove_wal_before_backup
    fxlogdb_new.flush()
IOError: [Errno 2] No such file or directory

It's possible the original exception happened in the same way (though we can't prove this to be the case, it is however the most likely explanation based on the information we have). This error is interesting because there aren't really any interactions outside of this function - fxlogdb_new is a temporary file created in the same directory as the xlog.db file which Barman uses to track the WAL metadata and it's created near the beginning of the function. Based on the timestamps of the last batch of WARNING: Ignoring deletion of WAL file messages there's about 250ms between the opening of the temporary file and the IOError: [Errno 2] No such file or directory error.

What filesystem are you using for the Barman directories? Are there any other processes in your environment which could be affecting the barman backup process' ability to write to the filesystem?

r-chauhan16 commented 2 years ago

Okay.

So the filesystem is Azure Blob Storage mount on RHEL7 as local mount point with local caching as ramdisk with the ruse of BLOBFUSE project backed by Microsoft.

blobfuse 126G 2.1G 118G 2% /barman-backup

Now, the barman is able to write wals under the incoming and wals directory. Barman is able to take backup of archives, able to write xlog.db files, so I believe there is no problem with the mount point and any process stopping the barman from doing so.

I can try to backup any other server. How can remove the old meta for 1 server from BARMAN.

ERROR: The backup has failed starting backup Asking PostgreSQL server to finalize the backup. ERROR: Backup failed writing backup label. DETAILS: [Errno 2] No such file or directory: '/barman-backup/server-powa/base/20220621T085405/data/backup_label' Processing xlog segments from file archival for server-powa 0000000100000012000000C0 0000000100000012000000C1 0000000100000012000000C1.000000D8.backup

mikewallace1979 commented 2 years ago

Blobfuse is not something which Barman has been tested with - officially we support XFS and Ext4 however some barman users do run other file systems with some success. Typically we see issues when lock files reside on a network-backed filesystem however I wonder if there could also be an issue with temporary file creation.

I will try and find some time to test Barman with Blobfuse and see if it reproduces the same issue you're seeing here.

The best way to clear out all the metadata for a server is to remove (or move) the server directory in the barman home dir.

r-chauhan16 commented 2 years ago

Hi Mike,

So I moved out the barman home directory from the Blobfuse directory to the local mount point and the backup was completed successfully.

So yes there is a support issue with Blobfuse mount point.

Can you share some light on what exactly happens in python code when the "barman backup server-name" command is issued?

Regards, Ravi

mikewallace1979 commented 2 years ago

Can you share some light on what exactly happens in python code when the "barman backup server-name" command is issued?

At a high level what happens is this:

  1. Check that the server can be backed up by running the same code as barman check.
  2. If it's the first backup for this server, create the necessary directories and write the server identity to a file.
  3. Acquire a backup lock for this server (implemented using flock in barman_lock_directory).
  4. Create the backup metadata to be written to the (backup.info) file.
  5. Run any pre-backup hook scripts.
  6. Perform the backup using the specified backup method (rsync, local-rsync or postgres).
  7. Create a restore point using the PostgreSQL pg_create_restore_point() function.
  8. Run any post-backup hook scripts.
  9. Release the backup lock for this server.
  10. Archive any WALs in the incoming or streaming directories.
  11. If this is the first backup, delete all WALs before the begin_wal of this backup.

Step 6 is where the details of the backup take place and the implementation differs depending on the backup method being used. The _purge_unused_wal_files call which seems to be affected in this issue happens both before and after the copy of the backup data from the PostgreSQL server to the Barman server.