PostgREST / postgrest

REST API for any Postgres database
https://postgrest.org
MIT License
23.27k stars 1.02k forks source link

fix: listener retries with exponential backoff #3536

Closed steve-chavez closed 4 months ago

steve-chavez commented 4 months ago

Continues https://github.com/PostgREST/postgrest/pull/3533

Now it works like:

$ postgrest-with-postgresql-16 --replica -f test/spec/fixtures/load.sql postgrest-run

19/May/2024:20:24:28 -0500: Retrying listening for notifications in 1 seconds...
19/May/2024:20:24:28 -0500: Failed listening for notifications on the "pgrst" channel. ERROR:  cannot execute LISTEN during recovery
19/May/2024:20:24:30 -0500: Retrying listening for notifications in 2 seconds...
19/May/2024:20:24:30 -0500: Failed listening for notifications on the "pgrst" channel. ERROR:  cannot execute LISTEN during recovery
19/May/2024:20:24:34 -0500: Retrying listening for notifications in 4 seconds...
19/May/2024:20:24:34 -0500: Failed listening for notifications on the "pgrst" channel. ERROR:  cannot execute LISTEN during recovery

Limitation

Once the listen channel is recovered, the retry status is not reset. So if the last backoff was 4 seconds, the next time recovery kicks in the backoff will be 8 seconds. If a recovery reaches 32 seconds, the backoff will stay there.

This is a problem with the interaction between hasql-notifications and retry. The Hasql.Notifications.waitForNotifications uses a forever loop that only finishes when it throws an exception, retry recovers on an exception and succeeds (here it restarts the retry status) when the function finishes .

I've left this as a TODO for now, it's still better than retrying without backoff.

steve-chavez commented 4 months ago

Really puzzled by this error that only happens on CI, not locally:

def set_statement_timeout(postgrest, role, milliseconds):
        """Set the statement timeout for the given role.
        For this to work reliably with low previous timeout settings,
        use a postgrest instance that doesn't use the affected role."""

        response = postgrest.session.post(
            "/rpc/set_statement_timeout", data={"role": role, "milliseconds": milliseconds}
        )
>       assert response.text == ""
E       assert '{"code":"57P01","details":null,"hint":null,"message":"terminating connection due to administrator command"}' == ''
E         
E         + {"code":"57P01","details":null,"hint":null,"message":"terminating connection due to administrator command"}

Whenever io tests fail, they're really hard to debug.

steve-chavez commented 4 months ago

It's like something is calling pg_terminate_backend. AFAICT there's no relationship between statement_timeout and LISTEN.

The only place that's done is here:

https://github.com/PostgREST/postgrest/blob/1ad6dc6b60d61651c56e800bdcb551145877af0d/test/io/test_io.py#L1220-L1222

Managed to reproduce the error locally, the only related db logs:

2024-05-19 21:16:48.888 UTC [908423] LOG:  statement: SELECT terminate_pgrst()
2024-05-19 21:16:48.889 UTC [907910] FATAL:  terminating connection due to administrator command
2024-05-19 21:16:48.889 UTC [908419] FATAL:  terminating connection due to administrator command
2024-05-19 21:16:48.889 UTC [908420] FATAL:  terminating connection due to administrator command
steve-chavez commented 4 months ago

It's like something is calling pg_terminate_backend

That was it, I just skipped this test:

https://github.com/PostgREST/postgrest/blob/44160502f423190978efe678c941357ac761b782/test/io/test_io.py#L1212-L1226

And the tests no longer fail.

The thing that has changed in this PR is that the listener no longer waits for the connectionWorker (since the listener now has its own backoff), so looks like the pg_terminate_backend was affecting the LISTEN connection too.


I think this can be fixed by isolating terminate_pgrst with a particular application name.

Fixed on https://github.com/PostgREST/postgrest/pull/3537