Logical Replication extension for PostgreSQL 17, 16, 15, 14, 13, 12, 11, 10, 9.6, 9.5, 9.4 (Postgres), providing much faster replication than Slony, Bucardo or Londiste, as well as cross-version upgrades.
I have a database cluster that crashed in a way I don’t understand.
Some details about the setup:
The database that crashed is running postgres 14.1
This database has three physical standbys using repmgr
The database allows another database in the same location to subscribe to logical replication subscriptions
The $PGDATA directory has 1.8TB total storage and was at 50% usage before this issue occurred
the pg_wal directory is symlinked to another partition that has 1.4Tb of available space
Recently during a deployment, there was a single transaction that converted many large tables from inheritance-based partitioned tables into declarative partitioned tables. The estimate of the data moved in this transaction is about 150GB of data. It should also be noted that all of these tables are part of a replication set for logical replication.
The way the SQL in this transaction is written:
BEGIN
In a DO block, create unlogged temporary tables as select * from (inheritance partitioned tables)
In a DO block, drop the old inheritance partitioned tables
Multiple CREATE TABLE statements to recreate these tables using declarative partitioning
DO blocks to create the individual partitions and indexes
DO block to set proper permissions
DO block that inserts the data from the unlogged temporary tables into the base tables and drops the temporary tables
END
All the SQL for this is being run as part of multiple DO pgplsql blocks within a single transaction. In the original failure, the tables being replaced with declarative partitions were part of the replication set with an active subscriber.
I thought that maybe my issue is that I shouldn’t have any active subscriptions that are replicating tables that will be dropped and recreated within a transaction. So, I tried something different: with the same active subscriber, I first removed the tables from the replication set and then executed the transaction that replaces the inheritance tables with the declarative versions. Even with the tables already removed from the replication set, the $PGDATA/base directory keeps growing until the disk is full.
This increase in the disk space for postgres data directory will stop when I drop the pglogical replication sets and replication slots (in other words just dropping the provider node).I have confirmed it’s not the pg_wal that’s holding up space.
We did a similar migration last year (using identical SQL with different table names) when our PG was on 10.12 and didn’t see this issue. Hence, I am confused on why having an active replication slot but no target table part of it will cause the postgres to fill up the base disk space. I have also compared the schema and table sizes of these target tables from before and after my sql run and there don’t seem to be any difference.
Postgres Version on Master database: 14.1
Postgres version on Subscriber database cluster: 14.1
Pglogical version on master: pglogical 2.4.0-1.rhel7
pglogical version on subscriber: pglogical 2.4.0-1.rhel7
I also tested this after upgrading pglogical to 2.4.1 and still found it to have the same issue.
I have a database cluster that crashed in a way I don’t understand.
Some details about the setup: The database that crashed is running postgres 14.1 This database has three physical standbys using repmgr The database allows another database in the same location to subscribe to logical replication subscriptions The $PGDATA directory has 1.8TB total storage and was at 50% usage before this issue occurred the pg_wal directory is symlinked to another partition that has 1.4Tb of available space Recently during a deployment, there was a single transaction that converted many large tables from inheritance-based partitioned tables into declarative partitioned tables. The estimate of the data moved in this transaction is about 150GB of data. It should also be noted that all of these tables are part of a replication set for logical replication.
The way the SQL in this transaction is written: BEGIN In a DO block, create unlogged temporary tables as select * from (inheritance partitioned tables) In a DO block, drop the old inheritance partitioned tables Multiple CREATE TABLE statements to recreate these tables using declarative partitioning DO blocks to create the individual partitions and indexes DO block to set proper permissions DO block that inserts the data from the unlogged temporary tables into the base tables and drops the temporary tables END All the SQL for this is being run as part of multiple DO pgplsql blocks within a single transaction. In the original failure, the tables being replaced with declarative partitions were part of the replication set with an active subscriber.
I thought that maybe my issue is that I shouldn’t have any active subscriptions that are replicating tables that will be dropped and recreated within a transaction. So, I tried something different: with the same active subscriber, I first removed the tables from the replication set and then executed the transaction that replaces the inheritance tables with the declarative versions. Even with the tables already removed from the replication set, the $PGDATA/base directory keeps growing until the disk is full.
This increase in the disk space for postgres data directory will stop when I drop the pglogical replication sets and replication slots (in other words just dropping the provider node).I have confirmed it’s not the pg_wal that’s holding up space. We did a similar migration last year (using identical SQL with different table names) when our PG was on 10.12 and didn’t see this issue. Hence, I am confused on why having an active replication slot but no target table part of it will cause the postgres to fill up the base disk space. I have also compared the schema and table sizes of these target tables from before and after my sql run and there don’t seem to be any difference.
Postgres Version on Master database: 14.1 Postgres version on Subscriber database cluster: 14.1 Pglogical version on master: pglogical 2.4.0-1.rhel7 pglogical version on subscriber: pglogical 2.4.0-1.rhel7
I also tested this after upgrading pglogical to 2.4.1 and still found it to have the same issue.