[Bug]: Race condition on startup

jhughes2112 commented 3 weeks ago

What happened?

I was trying to get this configuration running today, in a batch file (commands split out to separate lines for legibility):

docker run -d --rm --name db 
    -e PGDATA=/var/lib/postgresql/data/pgdata 
    -e POSTGRES_USER=pgroot 
    -e POSTGRES_PASSWORD=pgrootpass 
    postgres:16.3-alpine

docker run -d --rm --name fusionauth
    -e DATABASE_URL="jdbc:postgresql://db:5432/fusionauth" 
    -e DATABASE_ROOT_USERNAME=pgroot 
    -e DATABASE_ROOT_PASSWORD=pgrootpass 
    -e FUSIONAUTH_APP_URL=http://localhost:9011/ 
    -e FUSIONAUTH_APP_MEMORY=512M 
    -e FUSIONAUTH_APP_RUNTIME_MODE=development 
    -e FUSIONAUTH_SEARCH_ENGINE_TYPE=database 
    -e ES_JAVA_OPTS="-Xms512m -Xmx512m" 
    fusionauth/fusionauth-app:1.51.0

For some reason, this worked one time, presumably because it was busy pulling the docker images. Then failed another 20 times in a row and gave me a bunch of maintenance mode failures. I made sure there was no cached data between runs (the above is slightly simplified to not include mounted volumes).

What I discovered is if you are creating a fresh DB in a batch file just before running FusionAuth, you will either get "maintenance mode failure" and a bunch of errors logging in a "fusionauth" user on postgres. If you do a docker restart fusionauth it works ok and creates the DB as expected.

Adding a timeout /t 10 to delay the execution 10 seconds is enough time for postgres to do its initialization and accept connections properly.

Suggestion It would be better if the part of the startup that is looking to communicate with the DB were clever enough to realize it is not able to log in with the credentials it expects AFTER they were created, go back through the initial connection phase and try to create it again. Just guessing how postgres works, but it's likely the DB came up, it accepted connections and a few commands but it shut back down before any were executed, then started back up with a blank slate.

Try running these two commands in a batch file. You'll probably find the same issue is very easy to reproduce.

Thanks!

Version

1.51.0

Affects Versions

All back to the dawn of time

robotdan commented 3 weeks ago

Thanks for using FusionAuth!

We would normally recommend you use something like Helm or Docker Compose to build dependent docker services since they support this better.

Have you tried using the Docker Compose example? This example has health checks so that each service can wait until the dependent service is available.

https://github.com/FusionAuth/fusionauth-containers/blob/master/docker/fusionauth/docker-compose.yml

However, perhaps we could improve this - but the assumption is generally that the db is always available when we start up. But we could consider allowing for some wait when not in a production runtime mode so that the user wouldn't have to account for this delay when in development.

jhughes2112 commented 3 weeks ago

Thanks for the fast response. Our workflow has five different environments, and docker compose is not one of them. Our usual workflow is to allow starting or stopping and optionally delete all the data volumes as well. Tearing everything down happens in most of our environments, even kubernetes. We have seen the same behavior from postgres and FusionAuth in all environments.

From what I have seen, it's postgres actually being AVAILABLE momentarily while initializing itself for the first time. Which means it passes a typical health check. But by the time FusionAuth tries to use the DB and user it created, it fails because postgres threw away the commands while it was rebooting. Guesswork on my behalf, but the behavior follows what I can observe.

To be fair, this looks like it's not a FusionAuth issue at all, but since postgres is your default, and this problem has been reported for years without resolving it, I thought you might want to understand the circumstances.

robotdan commented 2 weeks ago

Thanks for the context @jhughes2112.

The tricky part of trying to do this in product is that it means we have to account for this failure, and wait, and retry. In general we avoid this tactic and expect all dependent services to be available on startup so we can start fast, and fail fast.

I don't know that there is much we can do in practice since there are off the shelf infrastructure related solutions to this such as helm and docker compose, and others. Unfortunately it sounds like those solutions aren't super helpful for you. 😔

FusionAuth / fusionauth-issues