Vonng / pigsty

Battery-Included PostgreSQL Distro as a Free RDS Alternative
https://pigsty.io
GNU Affero General Public License v3.0
2.97k stars 241 forks source link

Postgres slave reinit failed after pgBackRest according to the commands guided by pg-pitr #417

Closed Zmccll closed 1 month ago

Zmccll commented 1 month ago

Postgres pb info

postgres@pg-meta-1:~$ pb info
    stanza: pg-meta
    status: ok
    cipher: none

    db (current)
        wal archive min/max (16): 000000020000000000000022/000000020000000000000026

        full backup: 20240517-061002F
            timestamp start/stop: 2024-05-17 06:10:02+00 / 2024-05-17 06:10:04+00
            wal start/stop: 000000020000000000000022 / 000000020000000000000022
            database size: 25.1MB, database backup size: 25.1MB
            repo1: backup set size: 5.2MB, backup size: 5.2MB

        full backup: 20240517-061101F
            timestamp start/stop: 2024-05-17 06:11:01+00 / 2024-05-17 06:11:05+00
            wal start/stop: 000000020000000000000024 / 000000020000000000000024
            database size: 25.1MB, database backup size: 25.1MB
            repo1: backup set size: 5.2MB, backup size: 5.2MB

        diff backup: 20240517-061101F_20240517-061510D
            timestamp start/stop: 2024-05-17 06:15:10+00 / 2024-05-17 06:15:12+00
            wal start/stop: 000000020000000000000026 / 000000020000000000000026
            database size: 25.1MB, database backup size: 8.3KB
            repo1: backup set size: 5.2MB, backup size: 467B
            backup reference list: 20240517-061101F

the commands guided by pg-pitr

postgres@pg-meta-1:~$ pg-pitr -t "2024-05-17 06:15:00+00"
pgbackrest --stanza=pg-meta --type=time --target='2024-05-17 06:15:00+00' restore
Perform time PITR on pg-meta
[1. Stop PostgreSQL] ===========================================
   1.1 Pause Patroni (if there are any replicas)
       $ pg pause <cls>  # pause patroni auto failover
   1.2 Shutdown Patroni
       $ pt-stop         # sudo systemctl stop patroni
   1.3 Shutdown Postgres
       $ pg-stop         # pg_ctl -D /pg/data stop -m fast

[2. Perform PITR] ===========================================
   2.1 Restore Backup
       $ pgbackrest --stanza=pg-meta --type=time --target='2024-05-17 06:15:00+00' restore
   2.2 Start PG to Replay WAL
       $ pg-start        # pg_ctl -D /pg/data start
   2.3 Validate and Promote
     - If database content is ok, promote it to finish recovery, otherwise goto 2.1
       $ pg-promote      # pg_ctl -D /pg/data promote

[3. Restart Patroni] ===========================================
   3.1 Start Patroni
       $ pt-start;        # sudo systemctl start patroni
   3.2 Enable Archive Again
       $ psql -c 'ALTER SYSTEM SET archive_mode = on; SELECT pg_reload_conf();'
   3.3 Restart Patroni
       $ pt-restart      # sudo systemctl start patroni

[4. Restore Cluster] ===========================================
   3.1 Re-Init All Replicas (if any replicas)
       $ pg reinit <cls> <ins>
   3.2 Resume Patroni
       $ pg resume <cls> # resume patroni auto failover
   3.2 Make Full Backup (optional)
       $ pg-backup full  # pgbackrest --stanza=pg-meta backup --type=full

After performing the recovery as guided, the status of the database cluster is as follows:

postgres@pg-meta-1:~$ pg list pg-meta
+ Cluster: pg-meta (7369538344279343350) -----+----+-----------+-----------------+
| Member    | Host        | Role    | State   | TL | Lag in MB | Tags            |
+-----------+-------------+---------+---------+----+-----------+-----------------+
| pg-meta-1 | 10.60.10.10 | Leader  | running |  3 |           | clonefrom: true |
|           |             |         |         |    |           | conf: oltp.yml  |
|           |             |         |         |    |           | spec: 4C.8G.48G |
|           |             |         |         |    |           | version: '16'   |
+-----------+-------------+---------+---------+----+-----------+-----------------+
| pg-meta-2 | 10.60.10.9  | Replica | running |  1 |       448 | clonefrom: true |
|           |             |         |         |    |           | conf: oltp.yml  |
|           |             |         |         |    |           | spec: 4C.8G.48G |
|           |             |         |         |    |           | version: '16'   |
+-----------+-------------+---------+---------+----+-----------+-----------------+
| pg-meta-3 | 10.60.10.8  | Replica | running |  1 |       448 | clonefrom: true |
|           |             |         |         |    |           | conf: oltp.yml  |
|           |             |         |         |    |           | spec: 4C.8G.48G |
|           |             |         |         |    |           | version: '16'   |
+-----------+-------------+---------+---------+----+-----------+-----------------+

The TL version is inconsistent. Deleting the slave node and re-adding it did not resolve the issue. THE Error logs from pg-meta-2: 2024-05-17 06:29:50.777 UTC,"replicator","",48472,"10.60.10.9:33014",6646f95e.bd58,3,"TIMELINE_HISTORY",2024-05-17 06:29:50 UTC,7/0,0,ERROR,58P01,"could not open file ""pg_wal/00000002.history"": No such file or directory",,,,,,"TIMELINE_HISTORY 2",,,"pg-meta-2","walsender",,0

Zmccll commented 1 month ago

Pigsty version: 2.6.0 Postgres version:

 dbuser_dba@pg-meta-1:5432/postgres=# SELECT version();
                                                              version
-----------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 16.3 (Ubuntu 16.3-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
Zmccll commented 1 month ago

The entire recovery process

postgres@pg-meta-1:~$  pg pause
Success: cluster management is paused
postgres@pg-meta-1:~$ pt-stop
postgres@pg-meta-1:~$ pg-stop
waiting for server to shut down.... done
server stopped
postgres@pg-meta-1:~$ pgbackrest --stanza=pg-meta --type=time --target='2024-05-17 07:05:46+00' restore
2024-05-17 07:10:02.776 P00   INFO: restore command begin 2.51: --archive-mode=off --delta --exec-id=69015-4650a479 --link-all --log-level-console=info --log-level-file=detail --log-path=/pg/log/pgbackrest --pg1-path=/pg/data --process-max=4 --repo1-path=/pg/backup --spool-path=/pg/tmp --stanza=pg-meta --target="2024-05-17 07:05:46+00" --type=time
2024-05-17 07:10:02.783 P00   INFO: repo1: restore backup set 20240517-070332F, recovery will start at 2024-05-17 07:03:32
2024-05-17 07:10:02.785 P00   INFO: remove invalid files/links/paths from '/pg/data'
2024-05-17 07:10:03.437 P00   INFO: write updated /pg/data/postgresql.auto.conf
2024-05-17 07:10:03.451 P00   INFO: restore global/pg_control (performed last to ensure aborted restores cannot be started)
2024-05-17 07:10:03.452 P00   INFO: restore size = 25MB, file total = 981
2024-05-17 07:10:03.453 P00   INFO: restore command end: completed successfully (680ms)
postgres@pg-meta-1:~$ pg-start
waiting for server to start....2024-05-17 07:10:06.776 UTC [69023] LOG:  redirecting log output to logging collector process
2024-05-17 07:10:06.776 UTC [69023] HINT:  Future log output will appear in directory "/pg/log/postgres".
 done
server started
postgres@pg-meta-1:~$ pg-promote
waiting for server to promote.... done
server promoted
postgres@pg-meta-1:~$ pt-start
postgres@pg-meta-1:~$ psql -c 'ALTER SYSTEM SET archive_mode = on;'
ALTER SYSTEM
Time: 2.251 ms
postgres@pg-meta-1:~$ psql -c 'SHOW archive_mode;'
 archive_mode
--------------
 off
(1 row)

Time: 0.107 ms
postgres@pg-meta-1:~$ pg-restart
waiting for server to shut down.... done
server stopped
waiting for server to start....2024-05-17 07:10:36.326 UTC [69197] LOG:  redirecting log output to logging collector process
2024-05-17 07:10:36.326 UTC [69197] HINT:  Future log output will appear in directory "/pg/log/postgres".
 done
server started
postgres@pg-meta-1:~$ psql -c 'SHOW archive_mode;'
 archive_mode
--------------
 on
(1 row)

Time: 0.253 ms
postgres@pg-meta-1:~$ pt-restart
postgres@pg-meta-1:~$  pg reinit pg-meta
+ Cluster: pg-meta (7369858807812509474) -----+----+-----------+-----------------+
| Member    | Host        | Role    | State   | TL | Lag in MB | Tags            |
+-----------+-------------+---------+---------+----+-----------+-----------------+
| pg-meta-1 | 10.60.10.10 | Leader  | running |  2 |           | clonefrom: true |
|           |             |         |         |    |           | conf: oltp.yml  |
|           |             |         |         |    |           | spec: 4C.8G.48G |
|           |             |         |         |    |           | version: '16'   |
+-----------+-------------+---------+---------+----+-----------+-----------------+
| pg-meta-2 | 10.60.10.9  | Replica | running |  1 |         0 | clonefrom: true |
|           |             |         |         |    |           | conf: oltp.yml  |
|           |             |         |         |    |           | spec: 4C.8G.48G |
|           |             |         |         |    |           | version: '16'   |
+-----------+-------------+---------+---------+----+-----------+-----------------+
| pg-meta-3 | 10.60.10.8  | Replica | running |  1 |         0 | clonefrom: true |
|           |             |         |         |    |           | conf: oltp.yml  |
|           |             |         |         |    |           | spec: 4C.8G.48G |
|           |             |         |         |    |           | version: '16'   |
+-----------+-------------+---------+---------+----+-----------+-----------------+
 Maintenance mode: on
Which member do you want to reinitialize [pg-meta-3, pg-meta-2]? []: pg-meta-3
Are you sure you want to reinitialize members pg-meta-3? [y/N]: y
Success: reinitialize for member pg-meta-3
postgres@pg-meta-1:~$ pg list
+ Cluster: pg-meta (7369858807812509474) -----+----+-----------+-----------------+
| Member    | Host        | Role    | State   | TL | Lag in MB | Tags            |
+-----------+-------------+---------+---------+----+-----------+-----------------+
| pg-meta-1 | 10.60.10.10 | Leader  | running |  2 |           | clonefrom: true |
|           |             |         |         |    |           | conf: oltp.yml  |
|           |             |         |         |    |           | spec: 4C.8G.48G |
|           |             |         |         |    |           | version: '16'   |
+-----------+-------------+---------+---------+----+-----------+-----------------+
| pg-meta-2 | 10.60.10.9  | Replica | running |  1 |         0 | clonefrom: true |
|           |             |         |         |    |           | conf: oltp.yml  |
|           |             |         |         |    |           | spec: 4C.8G.48G |
|           |             |         |         |    |           | version: '16'   |
+-----------+-------------+---------+---------+----+-----------+-----------------+
| pg-meta-3 | 10.60.10.8  | Replica | running |  1 |         0 | clonefrom: true |
|           |             |         |         |    |           | conf: oltp.yml  |
|           |             |         |         |    |           | spec: 4C.8G.48G |
|           |             |         |         |    |           | version: '16'   |
+-----------+-------------+---------+---------+----+-----------+-----------------+
 Maintenance mode: on
postgres@pg-meta-1:~$
Vonng commented 1 month ago

You can choose one of the following ways to reconstruct other replicas:

  1. nuke the /pg/data/* and restart patroni on replicas one by one
  2. use the same pgbackrest command and restore on replica (fastest, but require central backup repo)
  3. pg reinit may have chance to fail if the LSN of PITR-ed primary is smaller than the current replica.