EnterpriseDB / barman

Barman - Backup and Recovery Manager for PostgreSQL
https://www.pgbarman.org/
GNU General Public License v3.0
2.07k stars 191 forks source link

Checks checksum consistency on backup chain #963

Closed gustabowill closed 2 months ago

gustabowill commented 2 months ago

Postgres incremental backups have the risk of the user switching the data_checkums configuration on the server between backups, which could potentially cause problems during recovery. An attempt of reducing these cases is implemented in this PR, where we will now store the state of data_checksums in the backup.info file of each backup, which will allow Barman to check the consistency of the chain during backup and recovery execution and emit warnings to the user so they can take action about it.

barthisrael commented 2 months ago

Just rebased on top of master.

barthisrael commented 2 months ago

This works as expected for backups:

$ barman list-backup all
pg17 20240726T231411 'incremental_2' - Fri Jul 26 21:14:13 2024 - Size: 35.4 MiB - WAL Size: 0 B
pg17 20240726T231359 'incremental_1' - Fri Jul 26 21:14:01 2024 - Size: 35.4 MiB - WAL Size: 32.0 MiB
pg17 20240726T231348 'full' - Fri Jul 26 21:13:51 2024 - Size: 251.3 MiB - WAL Size: 48.0 MiB
$ find ~/barman_home/ -name backup.info -print -exec bash -c "grep -i checksum {}; echo" \;
/home/vagrant/barman_home/pg17/base/20240726T231348/backup.info
data_checksums=on

/home/vagrant/barman_home/pg17/base/20240726T231359/backup.info
data_checksums=off

/home/vagrant/barman_home/pg17/base/20240726T231411/backup.info
data_checksums=on
$ barman backup pg17 --name incremental_3 --incremental incremental_2
Starting backup using postgres method for server pg17 in /home/vagrant/barman_home/pg17/base/20240726T231540
Backup start at LSN: 6/4000060 (000000010000000600000004, 00000060)
Starting backup copy via pg_basebackup for 20240726T231540
Copy done (time: 1 second)
Finalising the backup.
Backup size: 3.4 MiB
Backup end at LSN: 6/6000000 (000000010000000600000006, 00000000)
Backup completed (start time: 2024-07-26 23:15:40.967057, elapsed time: 1 second)
Processing xlog segments from streaming for pg17
    000000010000000600000004
    000000010000000600000005
WARNING: This is an incremental backup taken with `data_checksums = on` whereas some previous backups in the chain were taken with `data_checksums = off`. This can lead to potential recovery issues. Consider taking a new full backup to avoid having inconsistent backup chains.
WARNING: IMPORTANT: this backup is classified as WAITING_FOR_WALS, meaning that Barman has not received yet all the required WAL files for the backup consistency.
This is a common behaviour in concurrent backup scenarios, and Barman automatically set the backup as DONE once all the required WAL files have been archived.
Hint: execute the backup command with '--wait'
$ tail -4 /var/log/barman/barman.log
2024-07-26 23:15:42,905 [33686] barman.server WARNING: This is an incremental backup taken with `data_checksums = on` whereas some previous backups in the chain were taken with `data_checksums = off`. This can lead to potential recovery issues. Consider taking a new full backup to avoid having inconsistent backup chains.
2024-07-26 23:15:42,906 [33686] barman.server WARNING: IMPORTANT: this backup is classified as WAITING_FOR_WALS, meaning that Barman has not received yet all the required WAL files for the backup consistency.
This is a common behaviour in concurrent backup scenarios, and Barman automatically set the backup as DONE once all the required WAL files have been archived.
Hint: execute the backup command with '--wait'
barthisrael commented 2 months ago

This also works as expected when recovering -- a WARNING is issued:

$ barman list-backup all
pg17 20240726T231540 'incremental_3' - Fri Jul 26 21:15:42 2024 - Size: 35.4 MiB - WAL Size: 0 B
pg17 20240726T231411 'incremental_2' - Fri Jul 26 21:14:13 2024 - Size: 35.4 MiB - WAL Size: 48.0 MiB
pg17 20240726T231359 'incremental_1' - Fri Jul 26 21:14:01 2024 - Size: 35.4 MiB - WAL Size: 32.0 MiB
pg17 20240726T231348 'full' - Fri Jul 26 21:13:51 2024 - Size: 251.3 MiB - WAL Size: 48.0 MiB
$ find ~/barman_home/ -name backup.info -print -exec bash -c "grep -i checksum {}; echo" \;
/home/vagrant/barman_home/pg17/base/20240726T231348/backup.info
data_checksums=on

/home/vagrant/barman_home/pg17/base/20240726T231359/backup.info
data_checksums=off

/home/vagrant/barman_home/pg17/base/20240726T231411/backup.info
data_checksums=on

/home/vagrant/barman_home/pg17/base/20240726T231540/backup.info
data_checksums=on
$ barman recover pg17 incremental_3 ~/data_restore --recovery-staging-path /tmp
Start combining backup via pg_combinebackup for backup 20240726T231540 on /tmp
End combining backup via pg_combinebackup for backup 20240726T231540
Starting local restore for server pg17 using backup 20240726T231540
Destination directory: /home/vagrant/data_restore
Copying the base backup.
Copying required WAL segments.
Generating archive status files
Identify dangerous settings in destination directory.
WARNING: You recovered from an incremental backup where checksums were enabled on that backup, but not all backups in the chain. It is advised to disable, and optionally re-enable, checksums on the destination directory to avoid failures.

Recovery completed (start time: 2024-07-26 23:21:14.038575+02:00, elapsed time: less than one second)
Your PostgreSQL server has been successfully prepared for recovery!
$ tail -1 /var/log/barman/barman.log
2024-07-26 23:21:14,581 [34700] barman.recovery_executor WARNING: You recovered from an incremental backup where checksums were enabled on that backup, but not all backups in the chain. It is advised to disable, and optionally re-enable, checksums on the destination directory to avoid failures.
barthisrael commented 2 months ago

The previous tests were about cases where a WARNING was expected for backup and recovery commands, i.e. when the last backup has checksums enabled, but some backup in the chain does not.

I also tested the other cases, where there are divergences in the chain (mixes of on and off, but the latest backup is off), thus it's not expected that we get a WARNING. It went fine:

# Get list of backups
$ barman list-backup all
pg17 20240726T231540 'incremental_3' - Fri Jul 26 21:15:42 2024 - Size: 35.4 MiB - WAL Size: 0 B
pg17 20240726T231411 'incremental_2' - Fri Jul 26 21:14:13 2024 - Size: 35.4 MiB - WAL Size: 48.0 MiB
pg17 20240726T231359 'incremental_1' - Fri Jul 26 21:14:01 2024 - Size: 35.4 MiB - WAL Size: 32.0 MiB
pg17 20240726T231348 'full' - Fri Jul 26 21:13:51 2024 - Size: 251.3 MiB - WAL Size: 48.0 MiB

# Check their checksum state
$ find ~/barman_home/ -name backup.info -print -exec bash -c "grep -i checksum {}; echo" \;
/home/vagrant/barman_home/pg17/base/20240726T231348/backup.info
data_checksums=on

/home/vagrant/barman_home/pg17/base/20240726T231359/backup.info
data_checksums=off

/home/vagrant/barman_home/pg17/base/20240726T231411/backup.info
data_checksums=on

/home/vagrant/barman_home/pg17/base/20240726T231540/backup.info
data_checksums=on

# Data checkums are disabled
$ psql -h localhost -U postgres -c "SHOW data_checksums"
 data_checksums
----------------
 off
(1 row)

# Take a new backup -- no WARNING about checkums, as expected
$ barman backup pg17 --name incremental_4 --incremental incremental_3
Starting backup using postgres method for server pg17 in /home/vagrant/barman_home/pg17/base/20240726T233256
Backup start at LSN: 6/7000110 (000000010000000600000007, 00000110)
Starting backup copy via pg_basebackup for 20240726T233256
Copy done (time: 1 second)
Finalising the backup.
Backup size: 3.4 MiB
Backup end at LSN: 6/9000000 (000000010000000600000009, 00000000)
Backup completed (start time: 2024-07-26 23:32:56.187060, elapsed time: 1 second)
Processing xlog segments from streaming for pg17
    000000010000000600000007
    000000010000000600000008
WARNING: IMPORTANT: this backup is classified as WAITING_FOR_WALS, meaning that Barman has not received yet all the required WAL files for the backup consistency.
This is a common behaviour in concurrent backup scenarios, and Barman automatically set the backup as DONE once all the required WAL files have been archived.
Hint: execute the backup command with '--wait'
# List of backups
$ barman list-backup all
pg17 20240726T233256 'incremental_4' - Fri Jul 26 21:32:57 2024 - Size: 35.4 MiB - WAL Size: 0 B
pg17 20240726T231540 'incremental_3' - Fri Jul 26 21:15:42 2024 - Size: 35.4 MiB - WAL Size: 48.0 MiB
pg17 20240726T231411 'incremental_2' - Fri Jul 26 21:14:13 2024 - Size: 35.4 MiB - WAL Size: 48.0 MiB
pg17 20240726T231359 'incremental_1' - Fri Jul 26 21:14:01 2024 - Size: 35.4 MiB - WAL Size: 32.0 MiB
pg17 20240726T231348 'full' - Fri Jul 26 21:13:51 2024 - Size: 251.3 MiB - WAL Size: 48.0 MiB

# Their checksum state
$ find ~/barman_home/ -name backup.info -print -exec bash -c "grep -i checksum {}; echo" \;
/home/vagrant/barman_home/pg17/base/20240726T231348/backup.info
data_checksums=on

/home/vagrant/barman_home/pg17/base/20240726T231359/backup.info
data_checksums=off

/home/vagrant/barman_home/pg17/base/20240726T231411/backup.info
data_checksums=on

/home/vagrant/barman_home/pg17/base/20240726T231540/backup.info
data_checksums=on

/home/vagrant/barman_home/pg17/base/20240726T233256/backup.info
data_checksums=off

# Attempt to recover -- no WARNING, as expected
$ barman recover pg17 incremental_4 ~/data_restore --recovery-staging-path /tmp
Start combining backup via pg_combinebackup for backup 20240726T233256 on /tmp
End combining backup via pg_combinebackup for backup 20240726T233256
Starting local restore for server pg17 using backup 20240726T233256
Destination directory: /home/vagrant/data_restore
Copying the base backup.
Copying required WAL segments.
Generating archive status files
Identify dangerous settings in destination directory.

Recovery completed (start time: 2024-07-26 23:35:05.869706+02:00, elapsed time: less than one second)
Your PostgreSQL server has been successfully prepared for recovery!
gustabowill commented 2 months ago

@barthisrael I just applied your suggestions. I also left a thought in one of your replies about the output of show-backup command.

barthisrael commented 2 months ago

barman show-backup is also working, both in console and json writers:

# Console writer
$ barman -f console show-backup pg17 incremental_4
Backup 20240726T233256:
  Backup Name            : incremental_4
  Server Name            : pg17
  System Id              : 7382227563826631291
  Status                 : DONE
  PostgreSQL Version     : 170000
  PGDATA directory       : /var/lib/pgsql/17/data
  Checksums              : off

  Tablespaces:
    custom_ts_1          : /var/lib/pgsql/17/tablespaces/custom_ts_1 (oid: 24678)
    custom_ts_2          : /var/lib/pgsql/17/tablespaces/custom_ts_2 (oid: 24686)

  Base backup information:
    Disk usage           : 3.4 MiB (35.4 MiB with WALs)
    Incremental size     : 3.4 MiB (-0.00%)
    Timeline             : 1
    Begin WAL            : 000000010000000600000008
    End WAL              : 000000010000000600000009
    WAL number           : 2
    Begin time           : 2024-07-26 21:32:56.183376+00:00
    End time             : 2024-07-26 21:32:57.553398+00:00
    Copy time            : 1 second
    Estimated throughput : 2.5 MiB/s
    Begin Offset         : 40
    End Offset           : 0
    Begin LSN            : 6/8000028
    End LSN              : 6/9000000

  WAL information:
    No of files          : 0
    Disk usage           : 0 B
    WAL rate             : 195.18/hour
    Last available       : 000000010000000600000009

  Catalog information:
    Retention Policy     : VALID
    Previous Backup      : 20240726T231540
    Next Backup          : - (this is the latest base backup)

# json writer
$ barman -f json show-backup pg17 incremental_4 | jq
{
  "pg17": {
    "backup_id": "20240726T233256",
    "backup_name": "incremental_4",
    "base_backup_information": {
      "analysis_time": "less than one second",
      "analysis_time_seconds": 0,
      "begin_lsn": "6/8000028",
      "begin_offset": 40,
      "begin_time": "2024-07-26 21:32:56.183376+00:00",
      "begin_time_timestamp": "1722029576",
      "begin_wal": "000000010000000600000008",
      "copy_time": "1 second",
      "copy_time_seconds": 1.350093,
      "disk_usage": "3.4 MiB",
      "disk_usage_bytes": 3528561,
      "disk_usage_with_wals": "35.4 MiB",
      "disk_usage_with_wals_bytes": 37082993,
      "end_lsn": "6/9000000",
      "end_offset": 0,
      "end_time": "2024-07-26 21:32:57.553398+00:00",
      "end_time_timestamp": "1722029577",
      "end_wal": "000000010000000600000009",
      "incremental_size": "3.4 MiB",
      "incremental_size_bytes": 3528561,
      "incremental_size_ratio": "-0.00%",
      "number_of_workers": 1,
      "throughput": "2.5 MiB/s",
      "throughput_bytes": 2613568.8430352574,
      "timeline": 1
    },
    "catalog_information": {
      "next_backup": "- (this is the latest base backup)",
      "previous_backup": "20240726T231540",
      "retention_policy": "VALID"
    },
    "data_checksums": "off",
    "pgdata_directory": "/var/lib/pgsql/17/data",
    "postgresql_version": 170000,
    "status": "DONE",
    "tablespaces": [
      {
        "location": "/var/lib/pgsql/17/tablespaces/custom_ts_1",
        "name": "custom_ts_1",
        "oid": 24678
      },
      {
        "location": "/var/lib/pgsql/17/tablespaces/custom_ts_2",
        "name": "custom_ts_2",
        "oid": 24686
      }
    ],
    "wal_information": {
      "compression_ratio": 0,
      "disk_usage": "0 B",
      "disk_usage_bytes": 0,
      "last_available": "000000010000000600000009",
      "no_of_files": 0,
      "timelines": [],
      "wal_rate": "195.18/hour",
      "wal_rate_per_second": 0.05421688016258301
    }
  }
}
barthisrael commented 2 months ago

@barthisrael I just applied your suggestions. I also left a thought in one of your replies about the output of show-backup command.

I like your suggestion! Let's bring that up to the team and get their thoughts.