TritonDataCenter / containerpilot

A service for autodiscovery and configuration of applications running in containers
Mozilla Public License 2.0
1.13k stars 136 forks source link

Impatience waiting for Node app to start causes lots of "exit status 7" #204

Closed BobDickinson closed 8 years ago

BobDickinson commented 8 years ago

I have an AutoPilot implementation based on https://github.com/autopilotpattern/workshop

My Node app container is run by ContainerPilot via a startup script (it has no backends managed by ContinerPilot, but does have a preStart script). Here is the containerpilot.json:

{
  "consul": "consul:8500",
  "preStart": "/bin/prestart-synchro.sh",
  "services": [
    {
      "name": "synchro",
      "port": 80,
      "health": "/usr/bin/curl -o /dev/null --fail -s http://localhost:80/health",
      "poll": 3,
      "ttl": 10
    }
  ]
}

There are cases where the Node process can take a while to start up (to be ready to accept connections), specifically in the case where it has to pull in a bunch of data from a remote store. When I'm running in my desktop Docker env and pointing a Manta storage over a slow-ish connection, for example, my Node process can take 15-20 seconds to start.

When this happens, I get one of these every 3 seconds:

synchro_1   | 2016/08/10 22:16:01 exit status 7

I'm assuming given the format of the message and the fact that it comes at the same interval as the ContainerPilot polling interval I've specified, that this is coming from ContainerPilot - that ContainerPolot is not waiting for my Node process to complete startup before it starts doing the wellness checks, and those fail with code 7 ("Program is not running" or equiv) because, well, the program is not running (yet).

Everything works fine once my service is up, but it's disconcerting to see that message over and over (this is a container/solution that our customers will deploy, which is why I'm sensitive to this).

Given that my Node server starts asynchronously, I don't think there's really a way for ContainerPilot to know when it's "started" (short of noticing that it's listening on the port). And certainly if my process doesn't respond to a health check at some point after launch, that's a problem. But the "exit status 7" seems to indicate some awareness of the state (not sure why it's not complaining a lot more if it really is a failed health check). Maybe there needs to be a setting for "don't complain about non-running server / non-response to health checks for xx seconds, then after that, go ahead and complain".

BobDickinson commented 8 years ago

Actually, if the message just said "service not responding yet" instead of "exit status 7", I'd be happier.

tgross commented 8 years ago

Your assessment of the problem seems accurate. There's no way for ContainerPilot to determine whether an arbitrary application has "started." It looks like from https://curl.haxx.se/libcurl/c/libcurl-errors.html that the exit 7 is because there's no connection available. You won't be able to get the behavior you want out of a simple curl invocation.

Take a look at the reload.sh scripts in https://github.com/autopilotpattern/workshop/blob/master/sales/reload.sh for a possible way to solve this problem. Here the script is polling on that port, which blocks the exit of the script, until it can connect. So instead of using curl for your health check, try a script that does something like:

#!/bin/sh
while :
do
    netstat | grep -q 80 && /usr/bin/curl -o /dev/null --fail -s http://localhost:80/health && break
    sleep 1
done

This gets you the following behaviors:

BobDickinson commented 8 years ago

Ah, thanks for that. I didn't realize that the 7 was coming from my curl call.

And just to be clear, are you saying that I can change the "health" attribute in containerpilot.json to point to a script? I guess that makes sense if so, it just didn't occur to me.

Your solution is pretty close. I guess my concern is that for this to work the way I'd like, there needs to be some state. It's acceptable for my process to be non-listening for some period of time after startup, and after that, it's an error. And if it becomes non-listening at any point after it has ever been listening, then that is an error.

The timeout part I can handle in the health script, but I'm not sure how I know if it is the initial health check versus some subsequent health check (where I'd want to fail instantly if not listening).

tgross commented 8 years ago

And just to be clear, are you saying that I can change the "health" attribute in containerpilot.json to point to a script?

Yup! The user-defined hooks like this can all be arbitrarily rich pieces of software in their own right. (Check out autopilotpattern/mysql for the "extreme" version.)

It's acceptable for my process to be non-listening for some period of time after startup, and after that, it's an error.

Right but if it's non-listening yet should it really be passing health checks? If you just let the health check fail then it won't be registered (and thereby announced to other instances) until it's really ready to do work.

BobDickinson commented 8 years ago

I understood your suggestion to be basically "don't return from the health check until you're listening", which for the first health check works great for my delayed startup case. My not responding keeps the service from getting registered, since it's not ready yet.

But let's say I'm started up and a subsequent health check gets called, and again I find that I'm not listening. In that case, it means my service has died and I need to error out immediately. But how does my health check know that my server was ever started/healthy, given that it has no state? I guess maybe I could check Consul to see if I'm registered, but I'm not sure that's in the spirit of a health check (now my service can fail when Consul is unhealthy).

I'd honestly be fine just leaving the health check as is and letting it fail while waiting to startup, but I'm not happy about the "exit status 7" message (I have customers who are going to see that).

Is it more appropriate for me to write my own launcher script and have that launch my app and wait for it to listen? How is ContainerPilot going to feel about my main command not returning for 15-20 seconds? I'm assuming it won't try any health checks until I'm done?

tgross commented 8 years ago

In that case, it means my service has died and I need to error out immediately.

ContainerPilot doesn't send a message when you're unhealthy specifically to accommodate startup times. It's a heartbeat and if it isn't sent then the TTL expires. If we relied on ContainerPilot being able to send an "I'm unhealthy" message to the service catalog then we'd be in trouble if the container (or the server its on) completely crashes or has a network partition.

Is it more appropriate for me to write my own launcher script and have that launch my app and wait for it to listen? How is ContainerPilot going to feel about my main command not returning for 15-20 seconds? I'm assuming it won't try any health checks until I'm done?

The health checks run on their own threads/goroutines so they run in parallel with the main application. There's no way for ContainerPilot to know you're "done," so the health check polling starts immediately (well, after the number of seconds in poll).

BobDickinson commented 8 years ago

OK, so if I was to give my service 30 seconds to start listening in the health check, and I hit the case I'm talking about where on a health check subsequent to startup I found a non-listening service, my waiting 30 seconds on that check isn't really going to matter, because the TTL is going to expire my service before that anyway. I was excited about the idea of failing immediately in that case, but presumably anything within the TTL is acceptable (more or less by definition).

tgross commented 8 years ago

I think we've wrapped this one up. Closing but feel free to re-open if you think the issue is unresolved @BobDickinson