CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.95k stars 593 forks source link

PGUpgrade failing #3999

Closed todeb closed 3 weeks ago

todeb commented 1 month ago

Please ensure you do the following when reporting a bug:

Note that some logs needed to troubleshoot may be found in the /pgdata/<CLUSTERNAME>/pg_log directory on your Postgres instance.

An incomplete bug report can lead to delays in resolving the issue or the closing of a ticket, so please be as detailed as possible.

If you are looking for general support, please view the support page for where you can ask questions.

Thanks for reporting the issue, we're looking forward to helping you!

Overview

Pgupgrade fail when upgrading from postgres 13 to 16. Images used: crunchy-postgres:ubi8-13.8-1 crunchy-postgres:ubi8-16.3-1

Environment

Please provide the following details:

Steps to Reproduce

  1. create a cluster with: crunchy-postgres:ubi8-13.8-1
  2. follow doc for upgrading: https://access.crunchydata.com/documentation/postgres-operator/latest/guides/major-postgres-version-upgrade
  3. pgupgrade failed log:
    
    Performing PostgreSQL upgrade from version "13" to "16" ...

Step 1: Making new pgdata directory...

Step 2: Initializing new pgdata directory...

The files belonging to this database system will be owned by user "postgres". This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf-8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english".

Data page checksums are enabled.

fixing permissions on existing directory /pgdata/pg16 ... ok creating subdirectories ... ok selecting dynamic shared memory implementation ... posix sh: line 1: 29 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=100 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 31 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=50 -c shared_buffers=500 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 33 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=40 -c shared_buffers=400 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 35 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=30 -c shared_buffers=300 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 37 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 selecting default max_connections ... 20 sh: line 1: 39 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=16384 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 41 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=8192 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 43 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=4096 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 45 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=3584 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 47 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=3072 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 49 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=2560 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 51 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=2048 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 53 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=1536 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 55 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 57 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=900 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 59 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=800 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 61 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=700 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 63 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=600 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 65 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=500 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 67 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=400 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 69 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=300 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 71 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 73 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=100 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 75 Bus error (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=50 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 selecting default shared_buffers ... 400kB selecting default time zone ... UTC creating configuration files ... ok running bootstrap script ... ok performing post-bootstrap initialization ... ok syncing data to disk ... ok

Success. You can now start the database server using:

/usr/pgsql-16/bin/pg_ctl -D /pgdata/pg16 -l logfile start

initdb: warning: enabling "trust" authentication for local connections initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.

Step 3: Setting the expected permissions on the old pgdata directory...

Step 4: Copying shared_preload_libraries setting to new postgresql.conf file...

upgrade: line 18: /usr/pgsql-13/bin/postgres: No such file or directory Step 5: Running pg_upgrade check...

check for "/usr/pgsql-13/bin" failed: No such file or directory Failure, exiting

real 0m0.039s user 0m0.004s sys 0m0.008s


New postgres 16 cannot be started log:

k logs pgsql-ngt-test-pgha1-54bb-0 -n database-services Defaulted container "database" out of: database, replication-cert-copy, postgres-startup (init), nss-wrapper-init (init) 2024-09-19 10:52:08,477 INFO: No PostgreSQL configuration items changed, nothing to reload. 2024-09-19 10:52:08,629 WARNING: Postgresql is not running. 2024-09-19 10:52:08,629 INFO: Lock owner: None; I am pgsql-ngt-test-pgha1-54bb-0 2024-09-19 10:52:08,637 INFO: pg_controldata: pg_control version number: 1300 Catalog version number: 202307071 Database system identifier: 7416304783840694348 Database cluster state: shut down pg_control last modified: Thu Sep 19 10:50:25 2024 Latest checkpoint location: 0/186DFB0 Latest checkpoint's REDO location: 0/186DFB0 Latest checkpoint's REDO WAL file: 000000010000000000000001 Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:730 Latest checkpoint's NextOID: 13545 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 722 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 0 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTsXid: 0 Latest checkpoint's newestCommitTsXid: 0 Time of latest checkpoint: Thu Sep 19 10:50:25 2024 Fake LSN counter for unlogged rels: 0/3E8 Minimum recovery ending location: 0/0 Min recovery ending loc's timeline: 0 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: replica wal_log_hints setting: off max_connections setting: 20 max_worker_processes setting: 8 max_wal_senders setting: 10 max_prepared_xacts setting: 0 max_locks_per_xact setting: 64 track_commit_timestamp setting: off Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float8 argument passing: by value Data page checksum version: 1 Mock authentication nonce: 2a28ce363859f2cc621fe10e6b20e7038c6578ba4329e42f3d081523c1b6cd12

2024-09-19 10:52:08,640 INFO: Lock owner: None; I am pgsql-ngt-test-pgha1-54bb-0 2024-09-19 10:52:08,803 INFO: starting as a secondary 2024-09-19 10:52:09,889 INFO: postmaster pid=105 /tmp/postgres:5432 - no response 2024-09-19 10:52:09.920 UTC [105] LOG: redirecting log output to logging collector process 2024-09-19 10:52:09.920 UTC [105] HINT: Future log output will appear in directory "log". 2024-09-19 10:52:10,906 ERROR: postmaster is not running 2024-09-19 10:52:18,514 INFO: Lock owner: None; I am pgsql-ngt-test-pgha1-54bb-0 2024-09-19 10:52:18,641 INFO: failed to start postgres 2024-09-19 10:52:28,515 WARNING: Postgresql is not running. 2024-09-19 10:52:28,515 INFO: Lock owner: None; I am pgsql-ngt-test-pgha1-54bb-0 2024-09-19 10:52:28,525 INFO: pg_controldata: pg_control version number: 1300 Catalog version number: 202307071 Database system identifier: 7416304783840694348 Database cluster state: shut down pg_control last modified: Thu Sep 19 10:50:25 2024 Latest checkpoint location: 0/186DFB0 Latest checkpoint's REDO location: 0/186DFB0 Latest checkpoint's REDO WAL file: 000000010000000000000001 Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:730 Latest checkpoint's NextOID: 13545 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 722 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 0 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTsXid: 0 Latest checkpoint's newestCommitTsXid: 0 Time of latest checkpoint: Thu Sep 19 10:50:25 2024 Fake LSN counter for unlogged rels: 0/3E8 Minimum recovery ending location: 0/0 Min recovery ending loc's timeline: 0 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: replica wal_log_hints setting: off max_connections setting: 20 max_worker_processes setting: 8 max_wal_senders setting: 10 max_prepared_xacts setting: 0 max_locks_per_xact setting: 64 track_commit_timestamp setting: off Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float8 argument passing: by value Data page checksum version: 1 Mock authentication nonce: 2a28ce363859f2cc621fe10e6b20e7038c6578ba4329e42f3d081523c1b6cd12

2024-09-19 10:52:28,526 INFO: Lock owner: None; I am pgsql-ngt-test-pgha1-54bb-0 2024-09-19 10:52:28,527 INFO: starting as a secondary 2024-09-19 10:52:29,565 INFO: postmaster pid=176 2024-09-19 10:52:29.573 UTC [176] LOG: redirecting log output to logging collector process 2024-09-19 10:52:29.573 UTC [176] HINT: Future log output will appear in directory "log". /tmp/postgres:5432 - no response 2024-09-19 10:52:30,585 ERROR: postmaster is not running 2024-09-19 10:52:38,514 INFO: Lock owner: None; I am pgsql-ngt-test-pgha1-54bb-0 2024-09-19 10:52:38,515 INFO: failed to start postgres

When running with PG13 image it also refuses now to start, but was healthy before upgrade:

Defaulted container "database" out of: database, replication-cert-copy, postgres-startup (init), nss-wrapper-init (init) 2024-09-19 10:54:23,103 INFO: No PostgreSQL configuration items changed, nothing to reload. 2024-09-19 10:54:23,118 CRITICAL: system ID mismatch, node pgsql-ngt-test-pgha1-54bb-0 belongs to a different cluster: 7416304783840694348 != 7174435655098040412



### REPRO

above

### EXPECTED

pgupgrade complete without issues

### ACTUAL

pgupgrade failed, and postgres cluster not starting at all with new version and with the old one.

## Logs

Above

## Additional Information

na
todeb commented 1 month ago

for some additional info database with postgresql13 was initialized on host without hugepages, now it is running on host with hugepages enabled. When i scheduled the pgupgrade on host without hugepages there are no bus errors anymore although still upgrade fail.

Performing PostgreSQL upgrade from version "13" to "16" ...

Step 1: Making new pgdata directory...

Step 2: Initializing new pgdata directory...

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are enabled.

fixing permissions on existing directory /pgdata/pg16 ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.
syncing data to disk ... ok

Success. You can now start the database server using:

    /usr/pgsql-16/bin/pg_ctl -D /pgdata/pg16 -l logfile start

Step 3: Setting the expected permissions on the old pgdata directory...

Step 4: Copying shared_preload_libraries setting to new postgresql.conf file...

upgrade: line 18: /usr/pgsql-13/bin/postgres: No such file or directory
Step 5: Running pg_upgrade check...

check for "/usr/pgsql-13/bin" failed: No such file or directory
Failure, exiting

real    0m0.070s
user    0m0.000s
sys     0m0.009s
todeb commented 1 month ago

even running clean cluster on host without hugepages and trying pgupgrade on it also running pgupgrade on non hugapes host. It just fail. Same log as above.

todeb commented 1 month ago

i was able to perform uprade using crunchy-upgrade:ubi8-5.5.0-0 but on 5.6.0 and 5.6.1 upgrade fails.

So I think it is still something that should be addressed. Also these bus errors when running upgrade on host with hugepages.

MSandro commented 1 month ago

did you tried a step by step upgrade? 13 > 14 > 15 > 16

I just tried to upgrade 15 to 16, this worked fine with PGO 5.6.1

fara-tode commented 1 month ago

I tried to 15 and it failed. Pgsql 14 is not avaiable any more in repo. As i do not have the image for it i cannot test upgrade. 15->16 for sure will work. So that is not really helpful answer for that case.

tjmoore4 commented 4 weeks ago

@todeb Sorry to hear you hit this issue. From the logs you provided, I think the reason you are having this issue is due to the 5.6.0 upgrade image not having the required Postgres 13 binary whereas the 5.5.0 image does includes it. You can read more about this in our developer program container versions documentation. For your current scenario, I would therefore recommend using the 5.5.0 image for this upgrade, then moving to the newer images in the future. I hope this helps!

todeb commented 3 weeks ago

It is what i did. So if that is not issue but expected behavior that you cannot upgrade from older postgresql version having newer pgo and you need downgrade pgo first you can then close it.