fly-apps / postgres-flex

Postgres HA setup using repmgr
67 stars 32 forks source link

WAL-Archiving + PITR #231

Closed davissp14 closed 4 days ago

davissp14 commented 2 weeks ago

This PR introduces a WAL Archiving solution that integrates with any S3 compatible object storage. It also provides various remote recovery options for restoring into a new app. While in-place recovery options are possible with the provided functionality, the commands must be issued manually. The current automated recovery process will fail if there's an existing postgresql directory.

Docs for in-place restores will be coming soon.

Enabling WAL Archiving

To enable, set the following secret:

S3_ARCHIVE_CONFIG=https://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>@<AWS_ENDPOINT_URL_S3>/<BUCKET_NAME>/<PG_APP_NAME>

This most straight forward way to get started would be to create a new Tigris bucket via:

flyctl ext tigris create

AWS_ACCESS_KEY_ID: tid_key_id
AWS_ENDPOINT_URL_S3: https://fly.storage.tigris.dev
AWS_SECRET_ACCESS_KEY: tsec_secret_key
BUCKET_NAME: fragrant-glade-2283

Then set the secret with fly secrets set S3_ARCHIVE_CONFIG=... --app <pg-app-name>

Viewing/Updating WAL Archiving Configuration

Definitions

recovery_window (default: 7d) Used as a retention policy. Backups older than the specified value will be pruned. The shortest allowed recovery window is 1d.

Units available:

minimum_redundancy (default 3) The minimum number of backups that should always be available. Must be >= 0

full_backup_frequency (default: 24h) The frequency in which full backups are taken. Must be >= 1h

archive_timeout (default: 60s) This is a limit on how old unarchived data can be. Archiving typically happens once the WAL segment is full, however, this can cause issues for less active databases. So if the WAL segment doesn't fill by the specified timeout, a wal-switch will occur and force an archive event.

WARNING - Having too short of a archive timeout can have performance implications and bloat object storage.

View your current configuration

First, you can view the existing settings by running the following curl command:

curl http://<app-name>.internal:5500/commands/admin/settings/view/barman -s | python3 -m json.tool
{
    "result": {
        "archive_timeout": "60s",
        "full_backup_frequency": "24h",
        "minimum_redundancy": "3",
        "recovery_window": "7d"
    }
}

Update your configuration

To update a configuration option, you can run:

curl http://shaun-barman-00.internal:5500/commands/admin/settings/update/barman -d '{"recovery_window": "1d"}'
{"result":{"message":"Updated","restart_required":true}}

Then when we re-run the view command we can see the recovery window has been updated:

curl http://shaun-barman-00.internal:5500/commands/admin/settings/view/barman -s | python3 -m json.tool
{
    "result": {
        "archive_timeout": "60s",
        "full_backup_frequency": "24h",
        "minimum_redundancy": "3",
        "recovery_window": "1d"
    }
}

Note - This is a manual process, but the idea is that it make it straight forward to surface within a client and/or UI. These settings currently require a cluster restart, but in the future it should be possible to configure at runtime.

Remote Restore

Provision a new Postgres with the following secret:

S3_ARCHIVE_REMOTE_RESTORE_CONFIG=https://<access_key_id>:<access_secret>@<s3-endpoint>/<source-bucket-name>/<source-bucket-directory>

Note: If you don't specify a target it will default to restoring to the most recent point-in-time.

Target options

target Specifies that recovery should end as soon as a consistent state is reached. The only allowed value here is "immediate".

targetTime This parameter specifies the time stamp up to which recovery will proceed. The timestamp should follow RFC3339 formatting.

targetName This parameter specifies the named restore point.

targetLSN (Not implemented) This parameter specifies the LSN of the write-ahead log location up to which recovery will proceed.

TargetXID (Not implemented) This parameter specifies the transaction ID up to which recovery will proceed.

Additional target options

targetInclusive (default: true) Specifies whether to stop just after the specified recovery target (true), or just before the recovery target (false).

targetAction (default: promote) Specifies what action the server should take once the recovery target is reached

targetTimeline (default: current-timeline) Specifies recovering into a particular timeline. The default is to recover along the same timeline that was current when the base backup was taken.

Example usage

S3_ARCHIVE_REMOTE_RESTORE_CONFIG=https://<access_key_id>:<access_secret>@<s3-endpoint>/<source-bucket-name>/<source-bucket-directory>?targetTime=2024-07-03T20:23:46Z

For more information on these settings, see: https://www.postgresql.org/docs/current/runtime-config-wal.html#RUNTIME-CONFIG-WAL-RECOVERY-TARGET

benwaffle commented 1 week ago

can we change the env vars to match the default fly tigris create ones?

benwaffle commented 1 week ago

I'm seeing an error listing backups:

2024-07-03T06:07:43Z app[6833e41a77d908] ewr [info]panic: failed to restore base backup: failed to resolve backup target by time: failed to parse backup end time: parsing time "" as "Mon Jan 2 15:04:05 2006": cannot parse "" as "Mon"
2024-07-03T06:07:43Z app[6833e41a77d908] ewr [info]goroutine 1 [running]:
2024-07-03T06:07:43Z app[6833e41a77d908] ewr [info]main.panicHandler({0x9ba280?, 0xc000106340})
2024-07-03T06:07:43Z app[6833e41a77d908] ewr [info] /go/src/github.com/fly-apps/fly-postgres/cmd/start/main.go:198 +0x55
2024-07-03T06:07:43Z app[6833e41a77d908] ewr [info]main.main()
2024-07-03T06:07:43Z app[6833e41a77d908] ewr [info] /go/src/github.com/fly-apps/fly-postgres/cmd/start/main.go:75 +0xf50

here's the json:

postgres@6833e41a77d908:~$ barman-cloud-backup-list --cloud-provider aws-s3 --endpoint-url https://fly.storage.tigris.dev --profile restore --format json s3://dbdbdbbucket dbdbdb | python3 -m json.tool
{
    "backups_list": [
        {
            "backup_label": null,
            "begin_offset": 216,
            "begin_time": "Tue Jul  2 21:05:44 2024",
            "begin_wal": "00000005000000000000002A",
            "begin_xlog": "0/2A0000D8",
            "compression": null,
            "config_file": "/data/postgresql/postgresql.conf",
            "copy_stats": null,
            "deduplicated_size": null,
            "end_offset": null,
            "end_time": null,
            "end_wal": null,
            "end_xlog": null,
            "error": "failure uploading data (connection already closed)",
            "hba_file": "/data/postgresql/pg_hba.conf",
            "ident_file": "/data/postgresql/pg_ident.conf",
            "included_files": [
                "/data/postgresql/postgresql.auto.conf",
                "/data/postgresql/postgresql.internal.conf"
            ],
            "mode": null,
            "pgdata": "/data/postgresql",
            "server_name": "cloud",
            "size": null,
            "status": "FAILED",
            "systemid": "7332222271544570189",
            "tablespaces": null,
            "timeline": 5,
            "version": 150006,
            "xlog_segment_size": 16777216,
            "backup_id": "20240702T210544"
        },
        {
            "backup_label": "'START WAL LOCATION: 0/2E000028 (file 00000005000000000000002E)\\nCHECKPOINT LOCATION: 0/2E000060\\nBACKUP METHOD: streamed\\nBACKUP FROM: primary\\nSTART TIME: 2024-07-02 21:07:03 UTC\\nLABEL: Barman backup cloud 20240702T210703\\nSTART TIMELINE: 5\\n'",
            "begin_offset": 40,
            "begin_time": "Tue Jul  2 21:07:03 2024",
            "begin_wal": "00000005000000000000002E",
            "begin_xlog": "0/2E000028",
            "compression": null,
            "config_file": "/data/postgresql/postgresql.conf",
            "copy_stats": {
                "total_time": 5.192484,
                "number_of_workers": 2,
                "analysis_time": 0,
                "analysis_time_per_item": {
                    "data": 0
                },
                "copy_time_per_item": {
                    "data": 4.423057
                },
                "serialized_copy_time_per_item": {
                    "data": 2.484593
                },
                "copy_time": 4.423057,
                "serialized_copy_time": 2.484593
            },
            "deduplicated_size": null,
            "end_offset": 80,
            "end_time": "Tue Jul  2 21:07:05 2024",
            "end_wal": "00000005000000000000002F",
            "end_xlog": "0/2F000050",
            "error": null,
            "hba_file": "/data/postgresql/pg_hba.conf",
            "ident_file": "/data/postgresql/pg_ident.conf",
            "included_files": [
                "/data/postgresql/postgresql.auto.conf",
                "/data/postgresql/postgresql.internal.conf"
            ],
            "mode": null,
            "pgdata": "/data/postgresql",
            "server_name": "cloud",
            "size": null,
            "status": "DONE",
            "systemid": "7332222271544570189",
            "tablespaces": null,
            "timeline": 5,
            "version": 150006,
            "xlog_segment_size": 16777216,
            "backup_id": "20240702T210703"
        }
    ]
}
davissp14 commented 6 days ago

What do the query params look like for the S3_ARCHIVE_REMOTE_RESTORE_CONFIG secret?

EDIT Oh, I see. The backup failed, so it doesn't have an end_time. Good find.

UPDATE Issue should be fixed.