Open steve-chavez opened 3 years ago
For the admin port health check(https://github.com/PostgREST/postgrest/pull/2092), we'd also need to test:
/health
must reply with 503/health
must reply with 200.. then a disconnection happens.. /health
must reply with 503.. connection up again.. /health
must reply with 200 again.With db-channel-enabled
as True and False for both cases.
Dropping an idea for my own future reference here:
To allow breaking / unbreaking the pgrst <-> pg connection, we can create an individual symlink to the pg socket for each test-case - and then rename that accordingly. If renamed to something else this will break the connection. This will allow us to keep the PG server up - should be a lot faster than starting and stopping all the time. And it will not prevent us from running the io tests in parallel down the road.
At startup, when the schema cache load fails(could be because a statement_timeout=1
or REVOKEd privileges on pg_catalog) we try to reload it on an endless loop.
This is done here:
Is not simple to remove the autoreload of the schema cache because the connection can be lost in the connection worker (which can be because of DNS error, too many clients error from pg, etc).
If we remove the autoreload we might have a regression like https://github.com/PostgREST/postgrest/pull/1685.
Somehow joining the schema cache loading process on the connection worker and retrying it with exponential backoff(instead of endless loop) could be a solution.
Optional: A recovery test for EMFILE could be added as well: https://github.com/PostgREST/postgrest/pull/2158#issuecomment-1034074693
Metrics(https://github.com/PostgREST/postgrest/pull/2129) is required to test https://github.com/PostgREST/postgrest/issues/1094 and https://github.com/PostgREST/postgrest/issues/2364 (pool protection).
Related https://github.com/PostgREST/postgrest-docs/issues/557
Edit: this is already done on the io tests
Just realized that we can test the "pool protection" stuff by locking a pool of size 1 with an rpc/sleep and then testing the subsequent requests finish quickly.
We also need a postgres 15 to test https://github.com/PostgREST/postgrest/pull/2413
We also need tests for when we abort the recovery procedure. Namely checkIsFatal
:
The recovery logic can now be totally inside AppState, this would make it more understadable/manageable.
So right now we pass connectionWorker
to the main app:
And then we activate connectionWorker
based on a 503 response(Response.isServiceUnavailable
checks this):
Instead, this logic could be inside usePool
(without the 503 indirection):
We'd just have to catch SQL exceptions that we map to 503 there. Like:
I hesitate to refactor this right now that we don't have tests.
Looking at the libpq haskell lib, it has a reset function that says:
This might be useful for error recovery if a working connection is lost.
That makes me wonder if we could have the connection recovery off-core. Say in a hasql-pool-retry
library.
The ideal interface for us would expose a useRetry
wrapper like:
useRetry :: Pool -> IO () -> IO () -> Session a -> IO (Either WrappedUsageError a)
useRetry pool retryAction successAction sess = Hasql.use pool -- ...
On retryAction
we could log the failed retries like we do now. hasql-pool-retry
would internally do a reset
here(or even a simple SELECT 1
could be good for starters).
On successAction
we could reload our schema cache + config like we do now. This means that hasql-pool-retry
acquired the connection.
WrappedUsageError
is a wrapper over UsageError, which would contain an additional InRecoveryError x
where x
is time until next retry. With this we could inform clients that we're retrying the connection like we do now.
This seems generally useful outside of PostgREST to me. @robx WDYT?
The checkIsFatal
we do could also be represented by a FailedRecoveryError
, we could use this for killing the main thread as we do now.
Having a wait time(which we could equate to db-pool-acquisition-timeout
) for getting a connection after loss would be great too. This way our requests would be resilient to a pg_terminate_backend
(which is needed for read replicas https://github.com/PostgREST/postgrest/issues/2781).
This would be very interesting bc it's somewhat similar to pgbouncer pause/resume. Later on it could be used as a way to scale to multiple databases(https://github.com/PostgREST/postgrest/issues/2798).
Sorry, have been offline for a bit and missed this. Catching up these days
Regarding the concrete question about reset
(PQreset
in libpq):
I don't think that's going to be particularly useful for us. All it does is close the underlying connection and open a new one. This would only save allocating a new connection object (and the pool management overhead); but that should be insignificant compared to actually establishing a new connection to the postgres server.
I think I like the idea of a generic useRetry
, though, although at the time of writing I'm a bit fuzzy on how the retrying would work. Are there errors where it makes sense for PostgREST to retry internally (with some backoff strategy), and others where we defer this to the client? I'm not really clear on what useRetry
would do, particularly when successAction
is run. I'm imagining something roughly like:
useRetry retryState retryAction session = do
res <- use pool session
case checkRetriable retryState res of
Retry retryState' -> do
retryAction retryState'
backoffSleep retryState'
useRetry retryState' retryAction session
RanOutOfRetries -> do
return $ RetryFailed res -- maybe augmented with some retry details
NonRetriableError -> do
return $ Error res
Success -> do
successAction
return $ Success res
But then why not leave running successAction
to the caller?
Are there errors where it makes sense for PostgREST to retry internally (with some backoff strategy), and others where we defer this to the client?
Yes, for example when the password changes upstream (retrying is no use) - then the user would have to edit the database connection string anyway. We also have some extra conditions on checkIsFatal for stopping retrying. Hm, maybe useRetry
could also accept a list of ResultError for knowing when to stop?
But then why not leave running successAction to the caller?
Hm, yeah. I think that could work too. The interface was just an idea.
In case it helps, I've documented the recovery process here.
@robx Thinking more about it, we can have a much simpler interface. Just:
useRetry :: Pool -> Session a -> IO (Either WrappedUsageError a)
useRetry pool sess = Hasql.use pool -- ...
On retryAction we could log the failed retries like we do now.
No retryAction
. We don't really need to do this, it would be enough to log to stderr like we do for the acquisiton timeout:
The resulting WrappedUsageError
should be enough for this.
On successAction we could reload our schema cache + config like we do now. This means that hasql-pool-retry acquired the connection.
No successAction
. We can change the logic of the current "connection worker" to do the schema cache reloading by using useRetry
.
Hm, maybe useRetry could also accept a list of ResultError for knowing when to stop?
That would also be unnecessary since we can also act on the WrappedUsageError
when it happens.
So really the main goal is to have useRetry
wait like we do for the acquisition timeout now. With the difference that it would wait for the db to be reachable. It might even fit in hasql-pool
itself (but no problem if we do it on hasql-pool-retry
).
Also, I was thinking we should have this timeout be equal to the acquisition timeout but maybe it can be another configurable timeout. I see HikariCP having a initializationFailTimeout, which is similar to what we want to do here.
The simplest initial test case I think would be having a useRetry
be resilient to a pg_terminate_backend
(https://github.com/PostgREST/postgrest/issues/2781#issue-1706822853). After initializing the pool, use
just fails. useRetry
would wait and succeed.
Then we would cover other cases like a socket error as Wolfgang mentioned above.
useRetry :: Pool -> Session a -> IO (Either WrappedUsageError a) useRetry pool sess = Hasql.use pool -- ... No successAction. We can change the logic of the current "connection worker" to do the schema cache reloading by using useRetry.
Hm, forgot about one thing. So say we lose the connection and at this time the user runs migrations on the db, event trigger notifications won't fire for us. useRetry
then recovers the connection and we can serve requests again. However our schema cache is stale. This is why it's important to know when the pool reestablished a connection.
So maybe:
useRetry :: Pool -> Session a -> IO (Either WrappedUsageError (a, Bool))
The Bool
would indicate that the connection was recovered, with that we can reload the schema cache. Maybe that's preferrable to a successAction
.
To allow breaking / unbreaking the pgrst <-> pg connection, we can create an individual symlink to the pg socket for each test-case - and then rename that accordingly.
https://github.com/PostgREST/postgrest/issues/1766#issuecomment-1004391470
Related to the above, I just tried moving the socket file:
/run/user/1000/postgrest/postgrest-with-postgresql-16-FRk/socket$ mv .s.PGSQL.5432 ..
/run/user/1000/postgrest/postgrest-with-postgresql-16-FRk/socket$ mv ../.s.PGSQL.5432 .
And it does not break the connection if it's already established, doing curl localhost:3000/items
keeps working. But if the pool max idletime/lifetime is reached, then the new pool connection creation will fail.
The listener doesn't fail too.
So it looks like if we want to add io tests for this we also need to wait for the pool lifetime (looks prone to CI errors though) or else find another way to immediately break connections.
We need a recovery test for only breaking a LISTEN connection too. Related to https://github.com/PostgREST/postgrest/pull/3572.
Currently recovery tests are done manually, it'd be great to have them as automated tests.
These are the main scenarios:
(the connection recovery worker is referred as just "worker")
1. postgrest started with a pg connection, then pg becomes unavailable
{"details":"no connection to the server\n","message":"Database client error. Retrying the connection."}
ALTER ROLE postgrest_test_authenticator SET pgrst.db_schemas = 'public';
and try aGET /public_consumers
which should give a 404 if the in-db config isn't re-read.2. unavailable pg, postgrest started
503 {"message":"Database connection lost. Retrying the connection."}
Connection refused
. This must be because of themvarConnectionStatus
MVar, it doesn't happen on 1 though.3. SIGUSR1 - NOTIFY reload schema
refIsWorkerOn
, this can be confirmed by doing several SIGUSR1 and just noting oneAttempting to reconnect to the database in 1 seconds...
message. IfrefIsWorkerOn
is removed, there will be severalAttempting to reconnect to the database in 1 seconds...
mesagges.listener
recovers, e.g. doing aNOTIFY 'reload cache/load config'
should work after recovery.