Open steve-chavez opened 10 months ago
Just had this again. It seems the LISTEN on primary, conn pool on replicas thing makes it more likely to happen:
Jun 20 19:09:38 postgrest[2384]: 20/Jun/2024:19:09:38 +0000: {"code":"PGRST000","details":"connection to server at \"localhost\" (::1), port 5432 failed: FATAL: the database system is not yet accepting connections\nDETAIL: Consistent recovery state has not been yet reached.\nconnection to server at \"[redacted]\", port 5432 failed: session is not read-only\n","hint":null,"message":"Database connection error. Retrying the connection."}
Jun 20 19:09:38 postgrest[2384]: 20/Jun/2024:19:09:38 +0000: Attempting to reconnect to the database in 0 seconds...
Jun 20 19:09:38 postgrest[2384]: 20/Jun/2024:19:09:38 +0000: Failed listening for notifications on the "pgrst" channel. ERROR: could not access status of transaction 2251772 DETAIL: Could not open file "pg_xact/0002": No such file or directory.
The Listener just kept retrying while slowing down requests since the connection pool was normal, only the Listener failed.
Apparently, just running:
select pg_terminate_backend(pid)
from pg_stat_activity
where application_name = 'postgrest';
On the primary made things work again, without the need to restart the pg instance.
Since it's too hard to reproduce, we need to stop retrying the schema cache whenever the LISTENER fails. That would help mitigate the damage. Also related to https://github.com/PostgREST/postgrest/discussions/3313#discussioncomment-9825419.
To mitigate this, I think the easiest way would be to revert 4beac10d3d0116e0fa79324782b1baba62bb1368
Problem
This just happened on a production instance.
NOTIFY pgrst
wasn't triggering a schema cache reload while according topg_stat_activity
theLISTEN
channel was up for several days:Further investigation lead to finding this PostgreSQL bug:
More details:
So seems the LISTEN channel is broken. I can't reproduce this locally though.
The only way to repair this error is by restarting the PostgreSQL instance AFAICT.
Workaround
Killing the LISTEN channel will force PostgREST to reload.
Further measures
Do we have any way to prevent this? Would a shorter lifetime in the LISTEN channel help?