fly-apps / postgres-flex

Postgres HA setup using repmgr
87 stars 43 forks source link

Address Collation Mismatches #230

Open davissp14 opened 3 months ago

davissp14 commented 3 months ago

This should address: https://github.com/fly-apps/postgres-flex/issues/208

Problem There was a previous release that resulted in a collation version change. Users running the old version will run into collation mismatch issues when upgrade to the latest release. A change in collation can lead to corrupt indexes and other problems as the database system relies on stored objects having a certain sort order.

How we are addressing it

Collation is managed per-database, so when the primary boots we will establish a local connection to each database and refresh the associated collations.

The refresh operations are pretty lightweight, however, it does require us to establish a connection per-database which is something we don't want to do on every boot. To mitigate this, we take a hash of the locale version and persist it to disk once we have confirmed that no collation issues are present. Then on every subsequent boot, we simply compare the OS locale version with the version on disk and short-circuit if they match.

Important notes Refreshing the collation will update the version to match the OS locale version, however, there could some cases where certain objects need to rebuilt...

If you are running Flex version of < v0.0.43, then you may see some warnings like the following while you upgrade:

ord [info]postgres | 2024-06-23 23:18:12.866 UTC [386] WARNING:  database "postgres" has a collation version mismatch
ord [info]postgres | 2024-06-23 23:18:12.866 UTC [386] DETAIL:  The database was created using collation version 2.31, but the operating system provides version 2.36.

These warnings will continue until your primary is upgraded.

Reference https://www.postgresql.org/docs/current/sql-altercollation.html

davissp14 commented 3 months ago

So it looks like "technically" indexes and impacted objects should actually be rebuilt before the versions are refreshed. 🤔 The version refresh will clear the warning, but wouldn't necessarily mean the indexes won't get corrupted. We could potentially rebuild the objects manually, but this starts to push us pretty deep into the weeds...

My current thought is that we should block the fly image update upgrade path from < v0.0.43 to >= < v0.0.43 and see if we can come up with a pg_dump/pg_restore based solution, as it would allow us to side-step this problem.