Netflix / lemur-docker

Docker files for the Lemur certificate orchestration tool
170 stars 88 forks source link

Upgrade to 0.4.0 #10

Closed filipposc5 closed 7 years ago

filipposc5 commented 7 years ago

Had to upgrade to 0.4.0 to try it out, figured I 'd send it through.

Changes

Regarding first point, it will not exit on failure - maybe you want the retry timeout increased and exit with a more graceful message? I didn't want to change the behaviour too much.

If you want me to split this to separate PRs I can do that too.

kevgliss commented 7 years ago

Do you happen to know why/when the database fails to initialize? I wonder if we can fix the root issue instead of adding re-try logic. I haven't experience this myself.

filipposc5 commented 7 years ago

I had a bit of a high load on a macbook pro with encrypted disk, this is what happens sometimes (I did not re-run the test just copied from one found on previous history):

Successfully built e18ae565c773
Creating lemurdocker_postgres_1
Creating lemurdocker_lemur-web_1
Creating lemurdocker_lemur-nginx_1
Attaching to lemurdocker_postgres_1, lemurdocker_lemur-web_1, lemurdocker_lemur-nginx_1
postgres_1     | The files belonging to this database system will be owned by user "postgres".
postgres_1     | This user must also own the server process.
postgres_1     |
postgres_1     | The database cluster will be initialized with locale "en_US.utf8".
postgres_1     | The default database encoding has accordingly been set to "UTF8".
postgres_1     | The default text search configuration will be set to "english".
postgres_1     |
postgres_1     | Data page checksums are disabled.
postgres_1     |
postgres_1     | fixing permissions on existing directory /var/lib/postgresql/data ... ok
postgres_1     | creating subdirectories ... ok
postgres_1     | selecting default max_connections ... 100
lemur-web_1    | Waiting for db to become available
postgres_1     | selecting default shared_buffers ... 128MB
postgres_1     | selecting dynamic shared memory implementation ... posix
lemur-web_1    | psql: could not connect to server: Connection refused
lemur-web_1    |    Is the server running on host "postgres" (172.17.0.2) and accepting
lemur-web_1    |    TCP/IP connections on port 5432?
lemur-web_1    | Attempt to connect to db: $
postgres_1     | creating configuration files ... ok
lemur-web_1    | psql: could not connect to server: Connection refused
lemur-web_1    |    Is the server running on host "postgres" (172.17.0.2) and accepting
lemur-web_1    |    TCP/IP connections on port 5432?
postgres_1     | running bootstrap script ... ok
lemur-web_1    | Attempt to connect to db: $
postgres_1     | performing post-bootstrap initialization ... ok
lemur-web_1    | psql: could not connect to server: Connection refused
lemur-web_1    |    Is the server running on host "postgres" (172.17.0.2) and accepting
lemur-web_1    |    TCP/IP connections on port 5432?
lemur-web_1    | Attempt to connect to db: $
postgres_1     | syncing data to disk ... ok
postgres_1     |
postgres_1     | Success. You can now start the database server using:
postgres_1     |
kevgliss commented 7 years ago

I see, in that case I would say we could leave the re-try logic but then exit if unsuccessful with an error. There's not really much point continuing with the application bootup if there is no database.

filipposc5 commented 7 years ago

Just to clarify it happens all the time on high load. This is the full unmodified output (with the retry commit removed) https://gist.github.com/filipposc5/e34b150c628e02054602ede9ad8746e4

EDIT: the user experience is that at the furthest you try to login and you get 401 if I recall correctly.

filipposc5 commented 7 years ago

Actually just realised something, I had to nuke my VirtualBox VM (its disk was full) so maybe there is an additional overhead because it tries to allocate disk for the first time (as traditionally Vbox VM diskspace are lazy allocated).

kevgliss commented 7 years ago

By "all the time" do you mean even after the containers are up? Your gist seems to indicate that when under high load docker compose up can fail at times because postgres does not get created correctly. I'm not super familiar with docker, but is it possible to block the web container until we know for certain that postgres is up and healthly?

filipposc5 commented 7 years ago

Apologies for the confusion, I just meant I was able to reproduce the api starting faster than the db again (reliably) under heavy load. ("all the time under heavy load")

You got the point right, postgres db will get created/initialised successfully eventually, it is just that the api-startup script (psql utility) connects sooner than the DB is up and listening for connections. But docker doesn't know that, (I think) it just starts the pg container and after setting up the networking it considers the container started, and moves on to starting the API container which "boots" faster than the pg. psql on the other hand can wait up to 2 minutes if the host is not available or doesn't connect(), but if it resolves and connection is refused (our case) each command will fail pretty fast. So depending on speed the create db / create role commands can fail (sometimes it will fail only whichever comes first).

In general this is not very rare with docker.

I will adjust that script a bit for final review.

filipposc5 commented 7 years ago

I 've updated it the output a bit and made it exit 1 on failing to reach the db. I was tempted to add set -e to stop on error, but saw it fails if the db had already been created, so removed it. Less is more! Let me know if you have any suggestions for changes.

kevgliss commented 7 years ago

Looks good thanks!