Closed laurenceisla closed 3 weeks ago
Local binary:
PGRST_DB_URI=postgres://postgres@localhost:5432/postgres \
PGRST_DB_ANON_ROLE=postgres \
PGRST_DB_LOG_LEVEL=info \
postgrest-run
There seems to be a mistake above, postgrest-run
doesn't work but I can reproduce with the binary. Like so:
PGRST_DB_URI=postgres://postgres@localhost:5432/postgres \
PGRST_DB_ANON_ROLE=postgres \
PGRST_DB_LOG_LEVEL=info \
./postgrest
03/Jun/2024:13:40:53 -0500: Starting PostgREST 12.1...
03/Jun/2024:13:40:53 -0500: Listening on port 3000
03/Jun/2024:13:40:53 -0500: Attempting to connect to the database...
Assertion failed: *(&once->n) != 4 (threads.c: k5_once: 564)
Aborted (core dumped)
Edit: Corrected the issue description.
So, essentially you are saying that the static build fails to run, right?
Can't reproduce, yet:
PGRST_DB_LOG_LEVEL=info postgrest-with-postgresql-15 /nix/store/7c9y713ayzvj1812bhiac38f0wkvvcfv-postgrest-static-x86_64-unknown-linux-musl-12.1/bin/postgrest
postgrest-with-postgresql-15: You can connect with: psql 'postgres:///postgres?host=/run/user/1000/postgrest/postgrest-with-postgresql-15-2tR/socket' -U postgres
postgrest-with-postgresql-15: You can tail the logs with: tail -f /run/user/1000/postgrest/postgrest-with-postgresql-15-2tR/db.log
03/Jun/2024:20:39:36 +0200: Starting PostgREST 12.1...
03/Jun/2024:20:39:36 +0200: Attempting to connect to the database...
03/Jun/2024:20:39:36 +0200: Listening on port 3000
03/Jun/2024:20:39:36 +0200: Listening for notifications on the "pgrst" channel
03/Jun/2024:20:39:36 +0200: Successfully connected to PostgreSQL 15.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 13.2.0, 64-bit
03/Jun/2024:20:39:36 +0200: Config reloaded
03/Jun/2024:20:39:36 +0200: Schema cache queried in 1.9 milliseconds
03/Jun/2024:20:39:36 +0200: Schema cache loaded 0 Relations, 0 Relationships, 0 Functions, 0 Domain Representations, 4 Media Type Handlers
Can't reproduce by downloading the binary either.
Can reproduce with the downloaded binary and a local PG (non nix tooled)
PGRST_DB_LOG_LEVEL=info PGRST_DB_URI=postgres://postgres@localhost:5432/postgres PGRST_DB_ANON_ROLE=postgres ./postgrest
03/Jun/2024:20:46:46 +0200: Starting PostgREST 12.1...
03/Jun/2024:20:46:46 +0200: Listening on port 3000
03/Jun/2024:20:46:46 +0200: Attempting to connect to the database...
Assertion failed: *(&once->n) != 4 (threads.c: k5_once: 564)
[1] 1424861 IOT instruction (core dumped) PGRST_DB_LOG_LEVEL=info PGRST_DB_URI= PGRST_DB_ANON_ROLE=postgres ./postgrest
And with the fresh build of nix-build -A postgrestStatic
as well:
PGRST_DB_LOG_LEVEL=info PGRST_DB_URI=postgres://postgres@localhost:5432/postgres PGRST_DB_ANON_ROLE=postgres /nix/store/7c9y713ayzvj1812bhiac38f0wkvvcfv-postgrest-static-x86_64-unknown-linux-musl-12.1/bin/postgrest
03/Jun/2024:20:47:42 +0200: Starting PostgREST 12.1...
03/Jun/2024:20:47:42 +0200: Listening on port 3000
03/Jun/2024:20:47:42 +0200: Attempting to connect to the database...
Assertion failed: *(&once->n) != 4 (threads.c: k5_once: 564)
[1] 1427036 IOT instruction (core dumped) PGRST_DB_LOG_LEVEL=info PGRST_DB_URI= PGRST_DB_ANON_ROLE=postgres
This should allow bisecting.
Can reproduce with the downloaded binary and a local PG (non nix tooled)
Same for me, it doesn't happen when using the nix postgrest-with-postgresql-*
pg, only when using my distro postgres. What's strange is that it doesn't even log the database version, it just dies.
Error comes from kerberos lib: https://github.com/krb5/krb5//blob/5f0023d5f05e95021a7caa1193f76f86871222ce/src/util/support/threads.c#L558-L565
Maybe the error starts from https://github.com/PostgREST/postgrest/pull/3169? (GSSAPI)
Interesting: The error doesn't seem to happen every time. It fails sometimes and sometimes doesn't - on the same commit.
git bisect
points to 3cf56561 as the first bad commit.
Ah damn, it likely is the exception throwing and catching I did https://github.com/PostgREST/postgrest/blob/a46bea16e239d120804d7d125f1f737e4bfb3b8d/src/PostgREST/Listener.hs#L45-L53
Somehow it must be affecting some state on the C lib. I will try a different approach.
Why this doesn't happen on the local nix postgres though? Because it connects through unix sockets?
Why this doesn't happen on the local nix postgres though? Because it connects through unix sockets?
I don't think so. Using postgrest-run
in the nix environment but connecting to my local database over TCP works OK (it's my last example).
I think it needs the combination of static executable + TCP to break.
So it wasn't the exception catching/throwing, I just removed that and it fails the same.
Another change on https://github.com/PostgREST/postgrest/commit/3cf565614d6529b4f691699b254fd3a680dfcda0 is that now the listener and the connection worker start in parallel.
Before the listener waited for the connection worker, here https://github.com/PostgREST/postgrest/commit/3cf565614d6529b4f691699b254fd3a680dfcda0#diff-f6074642dc32a94e9e1997b21b0df631b2d3f098e0a81673edc22b89a86f49edL534-L538. But that was only done to reuse the exponential backoff retrying from the connection worker, I didn't know it would have this effect. I guess the kerberos lib doesn't work with parallel connections?
Hm, not sure whether this is about kerberos (but yes, could be). Could also be about libpq itself:
https://www.postgresql.org/docs/current/libpq-threading.html
One thread restriction is that no two threads attempt to manipulate the same PGconn object at the same time. In particular, you cannot issue concurrent commands from different threads through the same connection object.
Maybe something like that is happening accidentally?
Oh, on the same page later on:
If you are using Kerberos inside your application (in addition to inside libpq), you will need to do locking around Kerberos calls because Kerberos functions are not thread-safe. See function PQregisterThreadLock in the libpq source code for a way to do cooperative locking between libpq and your application.
Ah, that doesn't seem relevant, though, because we are not using kerberos additionally. But it could explain why libpq throws up in that area when maybe the thing from above happens?
I tested the current main branch with disabled gssapi / krb5. This makes the error go away, too.
The question is whether we can find a way / configuration where we can have your changes and gssapi-support?
For now, I pushed a revert of the GSSAPI commit. I will try this again, once we get a truly static libpq
build via pkgsStatic
- this could be related.
Oh wait.. I have a fix ready now that brings back waiting between the listener and connection worker..
Happy to have that in, if it improves the code. But I would still not want to release the GSSAPI change right now. I'm not 100% convinced that this can not cause problems elsewhere, maybe when two pool connections start up at the same time or something like that.
I actually kind of like the fact that we have a reproducer for this case, so I'm fine with keeping it as is.
Yeah, I think it's worth adding it back. If anything it reduces wastefully retrying both listener and the connection worker when postgres is completely down. Doing it on https://github.com/PostgREST/postgrest/pull/3573.
maybe when two pool connections start up at the same time or something like that.
Right, that could still be trouble.
Environment
Description of issue
Executing the binary or docker release for the
devel
branch , returns an error when trying to connect to the database.Local binary:
Docker:
Running the
main
branch from a Nix environment does not return this error: