LISTEN channel stops working due to postgres bug

steve-chavez commented 10 months ago

Problem

This just happened on a production instance. NOTIFY pgrst wasn't triggering a schema cache reload while according to pg_stat_activity the LISTEN channel was up for several days:

select * from pg_stat_activity where application_name ilike '%postgrest%' and query ilike '%listen%';
-[ RECORD 1 ]----+------------------------------
datid            | 12974
datname          | postgres
pid              | 1368652
leader_pid       | 
usesysid         | 16665
usename          | authenticator
application_name | postgrest
client_addr      | 127.0.0.1
client_hostname  | 
client_port      | 53768
backend_start    | 2023-12-28 11:23:50.967734+00
xact_start       | 
query_start      | 2023-12-28 11:23:50.985951+00
state_change     | 2023-12-28 11:23:50.986375+00
wait_event_type  | Client
wait_event       | ClientRead
state            | idle
backend_xid      | 
backend_xmin     | 
query_id         | 7372465010149019010
query            | LISTEN "pgrst"
backend_type     | client backend

Further investigation lead to finding this PostgreSQL bug:

postgres=# listen pgrst;
ERROR:  could not access status of transaction 1044524795
DETAIL:  Could not open file "pg_xact/03E4": No such file or directory.

postgres=# listen pgrst;
ERROR:  could not access status of transaction 1044524795
DETAIL:  Could not open file "pg_xact/03E4": No such file or directory.

postgres=# listen xx;
ERROR:  could not access status of transaction 1044524795
DETAIL:  Could not open file "pg_xact/03E4": No such file or directory.

postgres=# listen yy;
ERROR:  could not access status of transaction 1044524795
DETAIL:  Could not open file "pg_xact/03E4": No such file or directory.

More details:

So seems the LISTEN channel is broken. I can't reproduce this locally though.

The only way to repair this error is by restarting the PostgreSQL instance AFAICT.

Workaround

Killing the LISTEN channel will force PostgREST to reload.

select pg_terminate_backend(pid) from pg_stat_activity 
where application_name ilike '%postgrest%' and query ilike '%listen%';

Further measures

Do we have any way to prevent this? Would a shorter lifetime in the LISTEN channel help?

steve-chavez commented 4 months ago

Just had this again. It seems the LISTEN on primary, conn pool on replicas thing makes it more likely to happen:

Jun 20 19:09:38 postgrest[2384]: 20/Jun/2024:19:09:38 +0000: {"code":"PGRST000","details":"connection to server at \"localhost\" (::1), port 5432 failed: FATAL:  the database system is not yet accepting connections\nDETAIL:  Consistent recovery state has not been yet reached.\nconnection to server at \"[redacted]\", port 5432 failed: session is not read-only\n","hint":null,"message":"Database connection error. Retrying the connection."}
Jun 20 19:09:38 postgrest[2384]: 20/Jun/2024:19:09:38 +0000: Attempting to reconnect to the database in 0 seconds...
Jun 20 19:09:38 postgrest[2384]: 20/Jun/2024:19:09:38 +0000: Failed listening for notifications on the "pgrst" channel. ERROR:  could not access status of transaction 2251772 DETAIL:  Could not open file "pg_xact/0002": No such file or directory.

The Listener just kept retrying while slowing down requests since the connection pool was normal, only the Listener failed.

Apparently, just running:

select pg_terminate_backend(pid)
from pg_stat_activity
where application_name = 'postgrest';

On the primary made things work again, without the need to restart the pg instance.

steve-chavez commented 4 months ago

Since it's too hard to reproduce, we need to stop retrying the schema cache whenever the LISTENER fails. That would help mitigate the damage. Also related to https://github.com/PostgREST/postgrest/discussions/3313#discussioncomment-9825419.

steve-chavez commented 4 months ago

To mitigate this, I think the easiest way would be to revert 4beac10d3d0116e0fa79324782b1baba62bb1368

PostgREST / postgrest