2ndQuadrant / pglogical

Logical Replication extension for PostgreSQL 15, 14, 13, 12, 11, 10, 9.6, 9.5, 9.4 (Postgres), providing much faster replication than Slony, Bucardo or Londiste, as well as cross-version upgrades.
http://2ndquadrant.com/en/resources/pglogical/
Other
987 stars 153 forks source link

Running CREATE INDEX CONCURRENTLY on subscriber can break subscription (subscriber: postgres 15.6 + pglogical 2.4.4) - error message in logs: could not read block 0 in file ""base/[DB_OID]/[IDX_RELATION_FILENODE]"": read only 0 of 8192 bytes #469

Open aristocrates opened 5 months ago

aristocrates commented 5 months ago

I don't have a minimal repro for this right now, but the symptom I have observed on a subscriber that I am syncing for an upgrade is:

2024-04-02 15:16:17.297 PDT,,"$DBNAME",4712,,6608afb3.1268,3738,"",2024-03-30 17:34:59 PDT,7/184779565,0,ERROR,XX001,"could not read block 0 in file ""base/16400/105657687"": read only 0 of 8192 bytes",,,,,"apply INSERT from remote relation $TABLE_NAME in commit before 45415/83927478, xid 197646281 committed at 1999-12-31 16:12:45.411377-08 (action #2) from node replorigin 1",,,,"pglogical apply 16400:988484226","pglogical apply 16400:988484226",,0

In this particular example, 105657687 is indeed the result of calling pg_relation_filenode($INDEX_NAME) on the currently-INVALID $INDEX_NAME that is being created concurrently at the time the subscription goes down:

$DBNAME=# select pg_relation_filenode($INDEX_NAME); -- select oid from pg_database where datname = $DBNAME; yields 16400
 pg_relation_filenode
----------------------
            105657687
(1 row)

postgres@[host]:/path/to/data/base/16400$ ls -alh 105657687
-rw------- 1 postgres postgres 0 Apr  2 15:16 105657687

The subscriber is on postgres 15.6 and pglogical 2.4.4

A workaround that seems to work fine is to just create the index non-concurrently, in which case pglogical apply on the subscriber blocks on obtaining a RowExclusiveLock until the create index finishes, and the replication slot stays put on the provider.

Based on https://github.com/2ndQuadrant/pglogical/issues/344#issuecomment-927420890 (and other guides around using pglogical on DBs with large tables) the expected behavior is for CREATE INDEX CONCURRENTLY to work fine on the subscriber:

You can always create indexes after the initial load. If you are concerned about index build that locks out writes, you can always use CONCURRENTLY.

aristocrates commented 5 months ago

I have not noticed this on all indexes, just on a handful. I would be surprised if this happened every time an update, insert, or delete replicated over while the concurrent index creation was happening, since I do not recall running into this on other DB upgrades between the same postgres versions, but the behavior replicated when I retried index creation on these particular indexes, even after re-initializing the subscriber from scratch on a new machine.