boustrophedon / pgtemp

Rust library and daemon for easily starting postgres databases per-test without Docker
MIT License
219 stars 5 forks source link

Failure to set up database due to race between Postgres startup and `createdb` #7

Open reivilibre opened 2 months ago

reivilibre commented 2 months ago

Hi, thanks for writing this, I was thinking of such a thing months ago and it's always nice to see someone else wrote what I wanted :D.

The only problem I'm having is that occasionally the tests fail to start up. This is much more reproducible when my backup software is running in the background, putting a load on the system (and disk?).

The output I get most often is as follows:

createdb failed! stdout: 

stderr: createdb: error: connection to server at "localhost" (127.0.0.1), port 41499 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

thread 'tests::test_oidc_auth_flow::test_userinfo_bad_auth' panicked at /home/rei/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pgtemp-0.3.0/src/lib.rs:378:14:
failed to start pgtemp server: JoinError::Panic(Id(2), ...)

Occasionally I get the following instead

stderr: createdb: error: connection to server at "localhost" (127.0.0.1), port 35153 failed: FATAL:  the database system is starting up

I think these errors mean that Postgres hasn't started up yet, but pgtemp is already trying to run createdb.

I'm not sure if there's a perfect way of waiting for startup to happen (I guess you can possibly parse the logs?), so perhaps just trying createdb a few times in a loop with a small delay is the way to go.

I'm using pgtemp v0.3.0.

boustrophedon commented 2 months ago

Hi, thanks for reporting this! This is indeed a known issue: There's currently a 100ms sleep and a TODO in the code.

There are probably a few ways to fix this: Read the logs as you mentioned, use (or reimplement) pg_isready, make the startup delay configurable, or do some number of retries on createdb before actually failing (as you also mentioned).

Reading the logs is probably the best thing to do but probably entails something like spawning a new thread to consume the subprocess's piped stdout so that the pipe doesn't get blocked, and then the rest of the control flow gets messy.

It might be easiest to add a loop with a timeout/failure counter that calls pg_isready. It seems that all it really does is call PQpingParams (pg docs for PQpingParams) so another option would be to replace the createdb call with actually connecting to the db and doing CREATE DATABASE. Depending on how complicated pg's wire protocol is we might be able to get away with not taking an extra dependency and just crafting the packets ourselves.

If you'd like to make a PR that just retries createdb 5 times with the existing 100ms delay and prints out stderr/stdout from the last failure I'd be glad to accept it. Maybe add a TODO with a link to this issue as well.

kaaveland commented 2 months ago

I have something that works here: https://github.com/kaaveland/eugene/blob/344a5915722bce810bfe44b8955bbde21c53ab3d/eugene/src/tempserver.rs#L71

It's based on the log idea and indeed does spawn a second thread to watch it, sending logs back using a channel. I was meaning to replace this file with calls to pgtemp eventually, so I could take a stab at a PR that implements the same kind of solution here later this summer?

kaaveland commented 2 months ago

Or if the code from that file works for pgtemp, just take it - it's MIT licenses.

boustrophedon commented 2 months ago

Oh, this was the project on HN. Thanks again for the mention in your blog post :)

If you'd like to make the PR for the log watching code at some point that would be great. If other people mention this is an issue for them I might make a quick PR to just do the createdb retry loop.

The log-watching code looks good, the only thing that I'd want in pgtemp is a way to keep the logs rather than discarding them - might need to just wrap a string in Arc<Mutex<_>>, or just collect the logs in a separate thread and then send them on a channel when the process exits. So we'd have one channel for "we saw the ready to accept message" and one channel for "here's the full logfile".

The only issue might be spawning too many threads, but I feel like you'd probably run into issues with spawning too many databases before the extra threads become an issue.