PostgREST / postgrest

REST API for any Postgres database
https://postgrest.org
MIT License
22.65k stars 1k forks source link

"threads.c" error when running binaries or docker image for amd64 devel version #3569

Closed laurenceisla closed 3 weeks ago

laurenceisla commented 3 weeks ago

Environment

Description of issue

Executing the binary or docker release for the devel branch , returns an error when trying to connect to the database.

Local binary:

PGRST_DB_URI=postgres://postgres@localhost:5432/postgres \
PGRST_DB_ANON_ROLE=postgres \
PGRST_DB_LOG_LEVEL=info \
./postgrest
03/Jun/2024:13:19:01 -0500: Starting PostgREST 12.1...
03/Jun/2024:13:19:01 -0500: Listening on port 3000
03/Jun/2024:13:19:01 -0500: Attempting to connect to the database...
Assertion failed: *(&once->n) != 4 (threads.c: k5_once: 564)
fish: Job 1, 'PGRST_DB_URI=postgres://postgre…' terminated by signal PGRST_DB_ANON_ROLE=postgres \ (PGRST_DB_LOG_LEVEL=info \)
fish: Job ./postgrest, 'SIGABRT' terminated by signal Abort ()

Docker:

sudo docker run --net=host --rm --name test-db-conn -p 3000:3000 \
            -e PGRST_DB_URI="postgres://postgres@localhost:5432/postgres" \
            -e PGRST_DB_ANON_ROLE="postgres" \
            -e PGRST_LOG_LEVEL="info" \
            postgrest/postgrest:devel
03/Jun/2024:18:21:03 +0000: Starting PostgREST 12.1...
03/Jun/2024:18:21:03 +0000: Listening on port 3000
03/Jun/2024:18:21:03 +0000: Attempting to connect to the database...
Assertion failed: *(&once->n) != 4 (threads.c: k5_once: 564)

Running the main branch from a Nix environment does not return this error:

PGRST_DB_URI=postgres://postgres@localhost:5432/postgres \
PGRST_DB_ANON_ROLE=postgres \
PGRST_DB_LOG_LEVEL=info \
postgrest-run
03/Jun/2024:13:18:45 -0500: Starting PostgREST 12.1 (b6c6f2b)...
03/Jun/2024:13:18:45 -0500: Attempting to connect to the database...
03/Jun/2024:13:18:45 -0500: Listening on port 3000
03/Jun/2024:13:18:45 -0500: Successfully connected to PostgreSQL 15.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 12.3.0, 64-bit
...
steve-chavez commented 3 weeks ago

Local binary:

PGRST_DB_URI=postgres://postgres@localhost:5432/postgres \
PGRST_DB_ANON_ROLE=postgres \
PGRST_DB_LOG_LEVEL=info \
postgrest-run

There seems to be a mistake above, postgrest-run doesn't work but I can reproduce with the binary. Like so:

PGRST_DB_URI=postgres://postgres@localhost:5432/postgres \
PGRST_DB_ANON_ROLE=postgres \
PGRST_DB_LOG_LEVEL=info \
./postgrest

03/Jun/2024:13:40:53 -0500: Starting PostgREST 12.1...
03/Jun/2024:13:40:53 -0500: Listening on port 3000
03/Jun/2024:13:40:53 -0500: Attempting to connect to the database...
Assertion failed: *(&once->n) != 4 (threads.c: k5_once: 564)
Aborted (core dumped)

Edit: Corrected the issue description.

wolfgangwalther commented 3 weeks ago

So, essentially you are saying that the static build fails to run, right?

Can't reproduce, yet:

PGRST_DB_LOG_LEVEL=info postgrest-with-postgresql-15 /nix/store/7c9y713ayzvj1812bhiac38f0wkvvcfv-postgrest-static-x86_64-unknown-linux-musl-12.1/bin/postgrest
postgrest-with-postgresql-15: You can connect with: psql 'postgres:///postgres?host=/run/user/1000/postgrest/postgrest-with-postgresql-15-2tR/socket' -U postgres
postgrest-with-postgresql-15: You can tail the logs with: tail -f /run/user/1000/postgrest/postgrest-with-postgresql-15-2tR/db.log
03/Jun/2024:20:39:36 +0200: Starting PostgREST 12.1...
03/Jun/2024:20:39:36 +0200: Attempting to connect to the database...
03/Jun/2024:20:39:36 +0200: Listening on port 3000
03/Jun/2024:20:39:36 +0200: Listening for notifications on the "pgrst" channel
03/Jun/2024:20:39:36 +0200: Successfully connected to PostgreSQL 15.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 13.2.0, 64-bit
03/Jun/2024:20:39:36 +0200: Config reloaded
03/Jun/2024:20:39:36 +0200: Schema cache queried in 1.9 milliseconds
03/Jun/2024:20:39:36 +0200: Schema cache loaded 0 Relations, 0 Relationships, 0 Functions, 0 Domain Representations, 4 Media Type Handlers

Can't reproduce by downloading the binary either.

wolfgangwalther commented 3 weeks ago

Can reproduce with the downloaded binary and a local PG (non nix tooled)

PGRST_DB_LOG_LEVEL=info PGRST_DB_URI=postgres://postgres@localhost:5432/postgres PGRST_DB_ANON_ROLE=postgres ./postgrest
03/Jun/2024:20:46:46 +0200: Starting PostgREST 12.1...
03/Jun/2024:20:46:46 +0200: Listening on port 3000
03/Jun/2024:20:46:46 +0200: Attempting to connect to the database...
Assertion failed: *(&once->n) != 4 (threads.c: k5_once: 564)
[1]    1424861 IOT instruction (core dumped)  PGRST_DB_LOG_LEVEL=info PGRST_DB_URI= PGRST_DB_ANON_ROLE=postgres ./postgrest

And with the fresh build of nix-build -A postgrestStatic as well:

PGRST_DB_LOG_LEVEL=info PGRST_DB_URI=postgres://postgres@localhost:5432/postgres PGRST_DB_ANON_ROLE=postgres /nix/store/7c9y713ayzvj1812bhiac38f0wkvvcfv-postgrest-static-x86_64-unknown-linux-musl-12.1/bin/postgrest
03/Jun/2024:20:47:42 +0200: Starting PostgREST 12.1...
03/Jun/2024:20:47:42 +0200: Listening on port 3000
03/Jun/2024:20:47:42 +0200: Attempting to connect to the database...
Assertion failed: *(&once->n) != 4 (threads.c: k5_once: 564)
[1]    1427036 IOT instruction (core dumped)  PGRST_DB_LOG_LEVEL=info PGRST_DB_URI= PGRST_DB_ANON_ROLE=postgres

This should allow bisecting.

steve-chavez commented 3 weeks ago

Can reproduce with the downloaded binary and a local PG (non nix tooled)

Same for me, it doesn't happen when using the nix postgrest-with-postgresql-* pg, only when using my distro postgres. What's strange is that it doesn't even log the database version, it just dies.

steve-chavez commented 3 weeks ago

Error comes from kerberos lib: https://github.com/krb5/krb5//blob/5f0023d5f05e95021a7caa1193f76f86871222ce/src/util/support/threads.c#L558-L565

Maybe the error starts from https://github.com/PostgREST/postgrest/pull/3169? (GSSAPI)

wolfgangwalther commented 3 weeks ago

Interesting: The error doesn't seem to happen every time. It fails sometimes and sometimes doesn't - on the same commit.

wolfgangwalther commented 3 weeks ago

git bisect points to 3cf56561 as the first bad commit.

steve-chavez commented 3 weeks ago

Ah damn, it likely is the exception throwing and catching I did https://github.com/PostgREST/postgrest/blob/a46bea16e239d120804d7d125f1f737e4bfb3b8d/src/PostgREST/Listener.hs#L45-L53

Somehow it must be affecting some state on the C lib. I will try a different approach.

Why this doesn't happen on the local nix postgres though? Because it connects through unix sockets?

laurenceisla commented 3 weeks ago

Why this doesn't happen on the local nix postgres though? Because it connects through unix sockets?

I don't think so. Using postgrest-run in the nix environment but connecting to my local database over TCP works OK (it's my last example).

wolfgangwalther commented 3 weeks ago

I think it needs the combination of static executable + TCP to break.

steve-chavez commented 3 weeks ago

So it wasn't the exception catching/throwing, I just removed that and it fails the same.

Another change on https://github.com/PostgREST/postgrest/commit/3cf565614d6529b4f691699b254fd3a680dfcda0 is that now the listener and the connection worker start in parallel.

Before the listener waited for the connection worker, here https://github.com/PostgREST/postgrest/commit/3cf565614d6529b4f691699b254fd3a680dfcda0#diff-f6074642dc32a94e9e1997b21b0df631b2d3f098e0a81673edc22b89a86f49edL534-L538. But that was only done to reuse the exponential backoff retrying from the connection worker, I didn't know it would have this effect. I guess the kerberos lib doesn't work with parallel connections?

wolfgangwalther commented 3 weeks ago

Hm, not sure whether this is about kerberos (but yes, could be). Could also be about libpq itself:

https://www.postgresql.org/docs/current/libpq-threading.html

One thread restriction is that no two threads attempt to manipulate the same PGconn object at the same time. In particular, you cannot issue concurrent commands from different threads through the same connection object.

Maybe something like that is happening accidentally?

wolfgangwalther commented 3 weeks ago

Oh, on the same page later on:

If you are using Kerberos inside your application (in addition to inside libpq), you will need to do locking around Kerberos calls because Kerberos functions are not thread-safe. See function PQregisterThreadLock in the libpq source code for a way to do cooperative locking between libpq and your application.

Ah, that doesn't seem relevant, though, because we are not using kerberos additionally. But it could explain why libpq throws up in that area when maybe the thing from above happens?

wolfgangwalther commented 3 weeks ago

I tested the current main branch with disabled gssapi / krb5. This makes the error go away, too.

The question is whether we can find a way / configuration where we can have your changes and gssapi-support?

wolfgangwalther commented 3 weeks ago

For now, I pushed a revert of the GSSAPI commit. I will try this again, once we get a truly static libpq build via pkgsStatic - this could be related.

steve-chavez commented 3 weeks ago

Oh wait.. I have a fix ready now that brings back waiting between the listener and connection worker..

wolfgangwalther commented 3 weeks ago

Happy to have that in, if it improves the code. But I would still not want to release the GSSAPI change right now. I'm not 100% convinced that this can not cause problems elsewhere, maybe when two pool connections start up at the same time or something like that.

I actually kind of like the fact that we have a reproducer for this case, so I'm fine with keeping it as is.

steve-chavez commented 2 weeks ago

Yeah, I think it's worth adding it back. If anything it reduces wastefully retrying both listener and the connection worker when postgres is completely down. Doing it on https://github.com/PostgREST/postgrest/pull/3573.

maybe when two pool connections start up at the same time or something like that.

Right, that could still be trouble.