docker / compose

Define and run multi-container applications with Docker
https://docs.docker.com/compose/
Apache License 2.0
34.17k stars 5.25k forks source link

Is there a way to delay container startup to support dependant services with a longer startup time #374

Closed dancrumb closed 7 years ago

dancrumb commented 10 years ago

I have a MySQL container that takes a little time to start up as it needs to import data.

I have an Alfresco container that depends upon the MySQL container.

At the moment, when I use fig, the Alfresco service inside the Alfresco container fails when it attempts to connect to the MySQL container... ostensibly because the MySQL service is not yet listening.

Is there a way to handle this kind of issue in Fig?

d11wtq commented 10 years ago

At work we wrap our dependent services in a script that check if the link is up yet. I know one of my colleagues would be interested in this too! Personally I feel it's a container-level concern to wait for services to be available, but I may be wrong :)

nubs commented 10 years ago

We do the same thing with wrapping. You can see an example here: https://github.com/dominionenterprises/tol-api-php/blob/master/tests/provisioning/set-env.sh

bfirsh commented 10 years ago

It'd be handy to have an entrypoint script that loops over all of the links and waits until they're working before starting the command passed to it.

This should be built in to Docker itself, but the solution is a way off. A container shouldn't be considered started until the link it exposes has opened.

dancrumb commented 10 years ago

@bfirsh that's more than I was imagining, but would be excellent.

A container shouldn't be considered started until the link it exposes has opened.

I think that's exactly what people need.

For now, I'll be using a variation on https://github.com/aanand/docker-wait

silarsis commented 10 years ago

Yeah, I'd be interested in something like this - meant to post about it earlier.

The smallest impact pattern I can think of that would fix this usecase for us would be to be the following:

Add "wait" as a new key in fig.yml, with similar value semantics as link. Docker would treat this as a pre-requisite and wait until this container has exited prior to carrying on.

So, my docker file would look something like:

db:
  image: tutum/mysql:5.6

initdb:
  build: /path/to/db
  link:
    - db:db
  command: /usr/local/bin/init_db

app:
  link:
    - db:db
  wait:
    - initdb

On running app, it will start up all the link containers, then run the wait container and only progress to the actual app container once the wait container (initdb) has exited. initdb would run a script that waits for the database to be available, then runs any initialisations/migrations/whatever, then exits.

That's my thoughts, anyway.

dnephin commented 10 years ago

(revised, see below)

dsyer commented 10 years ago

+1 here too. It's not very appealing to have to do this in the commands themselves.

jcalazan commented 10 years ago

+1 as well. Just ran into this issue. Great tool btw, makes my life so much easier!

arruda commented 10 years ago

+1 would be great to have this.

prologic commented 10 years ago

+1 also. Recently run into the same set of problems

chymian commented 10 years ago

+1 also. any statement from dockerguys?

codeitagile commented 10 years ago

I am writing wrapper scripts as entrypoints to synchronise at the moment, not sure if having a mechanism in fig is wise if you have other targets for your containers that perform orchestration a different way. Seems very application specific to me, as such the responsibility of the containers doing the work.

prologic commented 10 years ago

After some thought and experimentation I do kind of agree with this.

As such an application I'm building basically has a synchronous waitfor(host, port) function that lets me waits for services the application is depending on (either detected via environment or explicitly configuration via cli options).

cheers James

James Mills / prologic

E: prologic@shortcircuit.net.au W: prologic.shortcircuit.net.au

On Fri, Aug 22, 2014 at 6:34 PM, Mark Stuart notifications@github.com wrote:

I am writing wrapper scripts as entrypoints to synchronise at the moment, not sure if having a mechanism in fig is wise if you have other targets for your containers that perform orchestration a different way. Seems very application specific to me as such the responsibility of the containers doing the work.

— Reply to this email directly or view it on GitHub https://github.com/docker/fig/issues/374#issuecomment-53036154.

shuron commented 10 years ago

Yes some basic "depend's on" neeeded here... so if you have 20 container, you just wan't to run fig up and everything starts with correct order... However it also have some timeout option or other failure catching mechanisms

ahknight commented 10 years ago

Another +1 here. I have Postgres taking longer than Django to start so the DB isn't there for the migration command without hackery.

dnephin commented 10 years ago

@ahknight interesting, why is migration running during run ?

Don't you want to actually run migrate during the build phase? That way you can startup fresh images much faster.

ahknight commented 10 years ago

There's a larger startup script for the application in question, alas. For now, we're doing non-DB work first, using nc -w 1 in a loop to wait for the DB, then doing DB actions. It works, but it makes me feel dirty(er).

dnephin commented 10 years ago

I've had a lot of success doing this work during the fig build phase. I have one example of this with a django project (still a work in progress through): https://github.com/dnephin/readthedocs.org/blob/fig-demo/dockerfiles/database/Dockerfile#L21

No need to poll for startup. Although I've done something similar with mysql, where I did have to poll for startup because the mysqld init script wasn't doing it already. This postgres init script seems to be much better.

arruda commented 10 years ago

Here is what I was thinking:

Using the idea of docker/docker#7445 we could implement this "wait_for_helth_check" attribute in fig? So it would be a fig not a Docker issue?

is there anyway of making fig check the tcp status on the linked container, if so then I think this is the way to go. =)

docteurklein commented 10 years ago

@dnephin can you explain a bit more what you're doing in Dockerfiles to help this ? Isn't the build phase unable to influence the runtime?

dnephin commented 10 years ago

@docteurklein I can. I fixed the link from above (https://github.com/dnephin/readthedocs.org/blob/fig-demo/dockerfiles/database/Dockerfile#L21)

The idea is that you do all the slower "setup" operations during the build, so you don't have to wait for anything during container startup. In the case of a database or search index, you would:

  1. start the service
  2. create the users, databases, tables, and fixture data
  3. shutdown the service

all as a single build step. Later when you fig up the database container it's ready to go basically immediately, and you also get to take advantage of the docker build cache for these slower operations.

docteurklein commented 10 years ago

nice! thanks :)

arruda commented 10 years ago

@dnephin nice, hadn't thought of that .

oskarhane commented 10 years ago

+1 This is definitely needed. An ugly time delay hack would be enough in most cases, but a real solution would be welcome.

dnephin commented 10 years ago

Could you give an example of why/when it's needed?

dacort commented 10 years ago

In the use case I have, I have an Elasticsearch server and then an application server that's connecting to Elasticsearch. Elasticsearch takes a few seconds to spin up, so I can't simply do a fig up -d because the application server will fail immediately when connecting to the Elasticsearch server.

ddossot commented 10 years ago

Say one container starts MySQL and the other starts an app that needs MySQL and it turns out the other app starts faster. We have transient fig up failures because of that.

oskarhane commented 10 years ago

crane has a way around this by letting you create groups that can be started individually. So you can start the MySQL group, wait 5 secs and then start the other stuff that depends on it.
Works in a small scale, but not a real solution.

arruda commented 10 years ago

@oskarhane not sure if this "wait 5 secs" helps, in some cases in might need to wait more (or just can't be sure it won't go over the 5 secs)... it's isn't much safe to rely on time waiting. Also you would have to manually do this waiting and loading the other group, and that's kind of lame, fig should do that for you =/

aanand commented 10 years ago

@oskarhane, @dacort, @ddossot: Keep in mind that, in the real world, things crash and restart, network connections come and go, etc. Whether or not Fig introduces a convenience for waiting on a TCP socket, your containers should be resilient to connection failures. That way they'll work properly everywhere.

ddossot commented 10 years ago

You are right, but until we fix all pre-existing apps to do things like gracefully recovering from the absence of their critical resources (like DB) on start (which is a Great Thing™ but unfortunately seldom supported by frameworks), we should use fig start to start individual container in a certain order, with delays, instead of fig up.

I can see a shell script coming to control fig to control docker :wink:

anentropic commented 9 years ago

I am ok with this not being built in to fig but some advice on best practice for waiting on readiness would be good

I saw in some code linked from an earlier comment this was done:

while ! exec 6<>/dev/tcp/${MONGO_1_PORT_27017_TCP_ADDR}/${MONGO_1_PORT_27017_TCP_PORT}; do
    echo "$(date) - still trying to connect to mongo at ${TESTING_MONGO_URL}"
    sleep 1
done

In my case there is no /dev/tcp path though, maybe it's a different linux distro(?) - I'm on Ubuntu

I found instead this method which seems to work ok:

until nc -z postgres 5432; do
    echo "$(date) - waiting for postgres..."
    sleep 1
done

This seems to work but I don't know enough about such things to know if it's robust... does anyone know if there's there a possible race condition between port showing up to nc and postgres server really able to accept commands?

I'd be happier if it was possible to invert the check - instead of polling from the dependent containers, is it possible instead to send a signal from the target (ie postgres server) container to all the dependents?

Maybe it's a silly idea, anyone have any thoughts?

aanand commented 9 years ago

@anentropic Docker links are one-way, so polling from the downstream container is currently the only way to do it.

does anyone know if there's there a possible race condition between port showing up to nc and postgres server really able to accept commands?

There's no way to know in the general case - it might be true for postgres, it might be false for other services - which is another argument for not doing it in Fig.

mindnuts commented 9 years ago

@aanand I tried using your docker/wait image approach but i am not sure what is happening. So basically i have this "Orientdb" container which lot of other NodeJS app containers link to. This orientdb container takes some amount of time to start listening on the TCP port and this makes the other containers to get "Connection Refused" error.

I hoped that by linking wait container to Orientdb i will not see this error. But unfortunately i am still getting it randomly. Here is my setup (Docker version 1.4.1, fig 1.0.1 on an Ubuntu 14.04 Box):

orientdb:
    build: ./Docker/orientdb
    ports:
        -   "2424:2424"
        -   "2480:2480"
wait:
    build: ./Docker/wait
    links:
        - orientdb:orientdb
....
core:
    build:  ./Docker/core
    ports:
        -   "3000:3000"
    links:
        -   orientdb:orientdb
        -   nsqd:nsqd

Any help is appreciated. Thanks.

aanand commented 9 years ago

@mindnuts the wait image is more of a demonstration; it's not suitable for use in a fig.yml. You should use the same technique (repeated polling) in your core container to wait for the orientdb container to start before kicking off the main process.

MrMMorris commented 9 years ago

+1 just started running into this as I am pulling custom built images vs building them in the fig.yml. Node app failing because mongodb is not ready yet...

kennu commented 9 years ago

I just spent hours debugging why MySQL was reachable when starting WordPress manually with Docker, and why it was offline when starting with Fig. Only now I realized that Fig always restarts the MySQL container whenever I start the application, so the WordPress entrypoint.sh dies not yet being able to connect to MySQL.

I added my own overridden entrypoint.sh that waits for 5 seconds before executing the real entrypoint.sh. But clearly this is a use case that needs a general solution, if it's supposed to be easy to launch a MySQL+WordPress container combination with Docker/Fig.

dnephin commented 9 years ago

so the WordPress entrypoint.sh dies not yet being able to connect to MySQL.

I think this is an issue with the WordPress container.

While I was initially a fan of this idea, after reading https://github.com/docker/docker/issues/7445#issuecomment-56391294, I think such a feature would be the wrong approach, and actually encourages bad practices.

There seem to be two cases which this issue aims to address:

A dependency service needs to be available to perform some initialization.

Any container initialization should really be done during build. That way it is cached, and the work doesn't need to be repeated by every user of the image.

A dependency service needs to be available so that a connection can be opened

The application should really be resilient to connection failures and retry the connection.

kennu commented 9 years ago

I suppose the root of the problem is that there are no ground rules as to whose responsibility it is to wait for services to become ready. But even if there were, I think it's a bit unrealistic to expect that developers would add database connection retrying to every single initialization script. Such scripts are often needed to prepare empty data volumes that have just been mounted (e.g. create the database).

The problem would actually be much less obtrusive if Fig didn't always restart linked containers (i.e. the database server) when restarting the application container. I don't really know why it does that.

aanand commented 9 years ago

The problem would actually be much less obtrusive if Fig didn't always restart linked containers (i.e. the database server) when restarting the application container. I don't really know why it does that.

Actually it doesn't just restart containers, it destroys and recreates them, because it's the simplest way to make sure changes to fig.yml are picked up. We should eventually implement a smarter solution that can compare "current config" with "desired config" and only recreate what has changed.

Getting back to the original issue, I really don't think it's unrealistic to expect containers to have connection retry logic - it's fundamental to designing a distributed system that works. If different scripts need to share it, it should be factored out into an executable (or language-specific module if you're not using shell), so each script can just invoke waitfor db at the top.

docteurklein commented 9 years ago

@kennu what about --no-recreate ? /cc @aanand

kennu commented 9 years ago

@aanand I meant the unrealism comment from the point of view the Docker Hub is already full of published images that probably don't handle connection retrying in their initialization scripts, and that it would be quite an undertaking to get everybody to add it. But I guess it could be done if Docker Inc published some kind of official guidelines / requirements.

Personally I'd rather keep containers/images simple though and let the underlying system worry about resolving dependencies. In fact, Docker's restart policy might already solve everything (if the application container fails to connect to the database, it will restart and try again until the database is available).

But relying on the restart policy means that it should be enabled by default, or otherwise people spend hours debugging the problem (like I just did). E.g. Kubernetes defaults to RestartPolicyAlways for pods.

MrMMorris commented 9 years ago

any progress on this? I would like to echo that expecting all docker images to change and the entire community implement connection retry practices is not reasonable. Fig is a Docker orchestration tool and the problem lies in the order it does things so the change needs to be made in Fig, not Docker or the community.

dnephin commented 9 years ago

expecting all docker images to change and the entire community implement connection retry practices is not reasonable

It's not that an application should need to retry because of docker or fig. Applications should be resilient to dropped connections because the network is not reliable. Any application should already be built this way.

I personally haven't had to implement retries in any of my containers, and I also haven't needed any delay or waiting on startup. I believe most cases of this problem fall into these two categories (my use of "retry" is probably not great here, I meant more that it would re-establish a connection if the connection was closed, not necessarily poll for some period attempting multiple times).

If you make sure that all initialization happens during the "build" phase, and that connections are re-established on the next request you won't need to retry (or wait on other containers to start). If connections are opened lazily (when the first request is made), instead of eagerly (during startup), I suspect you won't need to retry at all.

the problem lies in the order [fig] does things

I don't see any mention of that in this discussion so far. Fig orders startup based on the links specified in the config, so it should always start containers in the right order. Can you provide a test case where the order is incorrect?

thaJeztah commented 9 years ago

I have to agree with @dnephin here. Sure, it would be convenient if compose/fig was able to do some magic and check availability of services, however, what would the expected behavior be if a service doesn't respond? That really depends on the requirements of your application/stack. In some cases, the entire stack should be destroyed and replaced with a new one, in other cases a failover stack should be used. Many other scenarios can be thought of.

Compose/Fig cannot make these decisions, and monitoring services should be the responsibility of the applications running inside the container.

kennu commented 9 years ago

I would like to suggest that @dnephin has merely been lucky. If you fork two processes in parallel, one of which will connect to a port that the other will listen to, you are essentially introducing a race condition; a lottery to see which process happens to initialize faster.

I would also like to repeat the WordPress initialization example: It runs a startup shell script that creates a new database if the MySQL container doesn't yet have it (this can't be done when building the Docker image, since it's dependent on the externally mounted data volume). Such a script becomes significantly more complex if it has to distinguish generic database errors from "database is not yet ready" errors and implement some sane retry logic within the shell script. I consider it highly likely that the author of the image will never actually test the startup script against the said race condition.

Still, Docker's built-in restart policy provides a workaround for this, if you're ready to accept that containers sporadically fail to start and regularly print errors in logs. (And if you remember to turn it on.)

Personally, I would make Things Just Work, by making Fig autodetect which container ports are exposed to a linked container, ping them before starting the linked container (with a sane timeout), and ultimately provide a configuration setting to override/disable this functionality.

thaJeztah commented 9 years ago

this can't be done when building the Docker image, since it's dependent on the externally mounted data volume

True. An approach here is to start just the database container once (if needed, with a different entrypoint/command), to initialise the database, or use a data-only container for the database, created from the same image as the database container itself.

Such a script becomes significantly more complex if it has to distinguish generic database errors from "database is not yet ready" errors

Compose/Fig will run into the same issue there; How to check if MySQL is up, and accepting connections? (and PostgreSQL, and (insert your service here)). Also, where should the "ping" be executed from? Inside the container you're starting, from the host?

As far as I can tell, the official WordPress image includes a check to see if MySQL is accepting connections in the docker-entrypoint.sh

kennu commented 9 years ago

@thaJeztah "Add some simple retry logic in PHP for MySQL connection errors" authored by tianon 2 days ago - Nice. :-) Who knows, maybe this will become a standard approach after all, but I still have my doubts, especially about this kind of retry implementations actually having being tested by all image authors.

About the port pinging - I can't say offhand what the optimal implementation would be. I guess maybe simple connection checking from a temporary linked container and retrying while getting ECONNREFUSED. Whatever solves 80% (or possibly 99%) of the problems, so users don't have to solve them by themselves again and again every time.

thaJeztah commented 9 years ago

@kennu Ah! Thanks, wasn't aware it was just added recently, just checked the script now because of this discussion.

To be clear, I understand the problems you're having, but I'm not sure Compose/Fig would be able to solve them in a clean way that works for everyone (and reliably). I understand many images on the registry don't have "safeguards" in place to handle these issues, but I doubt it's Compose/Fig's responsibility to fix that.

thaJeztah commented 9 years ago

Having said the above; I do think it would be a good thing to document this in the Dockerfile best practices section.

People should be made aware of this and some examples should be added to illustrate how to handle service "outage". Including a link to the WikiPedia article that @dnephin mentioned (and possibly other sources) for reference.