Better way to determine if process has started

msabramo commented 9 years ago

My understanding is that supervisor's only way of knowing if a process has started is if it startsecs seconds have elapsed and the process hasn't died. Correct me if I'm wrong.

We routinely run into problems where folks have specified, say, startsecs = 5 and then the system is a bit slow so that it takes the process longer than usual to initialize. So then it is initializing for say 10 seconds and then it dies because it can't get a connection to a database or something. Because 5 seconds have elapsed, supervisor considers this to be that the process started successfully and then died, so it restarts the process. This repeats ad-infinitum, with the result that the process goes up and down and CPU usage on the box becomes very high and now that causes other processes to be slow, possibly resulting in cascading failures.

It is difficult to pick a startsecs high enough to prevent this and undesirable because you have to wait a long time to get confirmation that the process has started and even then you have no guarantee that it's not going to die a couple of seconds later.

I wonder if supervisor could have a healthcheck_command option for processes so that you can tell supervisor a better way to determine whether the process has started successfully. For example this could be set to check if the process is listening on a port or it could do a simple HTTP request.

This would take precedence over startsecs. As soon as the command goes from failing to succeeding, the process is considered to have started.

Thoughts?

msabramo commented 9 years ago

@mnaberez, @mcdonc: Thoughts on this? I feel like there might be an easier way involving clever use of settings or plugins or something and I'm not seeing it.

msabramo commented 9 years ago

For instance if a monitoring plugin like superlance could set the process status. But I've never actually used superlance.

elgalu commented 9 years ago

+1 for healthcheck_command, seconds based waiting is inefficient.

elgalu commented 9 years ago

e.g. where the last line would be the healthcheck_command and avoid unnecessary waiting.

echo "Waiting for PostgreSQL server to be ready..."
# pg_isready -- check the connection status of a PostgreSQL server
while ! pg_isready -q --host=localhost --port=5432; do sleep 1; done

Lucretiel commented 9 years ago

Anything that involves new commands makes me nervous... many of the same issues with the processes themselves can apply to the support commands (what if they exit nonzero? What if they never exit? etc). Many similar problems were brought up in #147, which proposed a stop command for programs. Would it be possible/better to enumerate the most common cases (for instance, wait for a port, or wait for a particular pattern from stdout/stderr), and consider support for those, instead?

elgalu commented 9 years ago

Well wait for a port is not enough in the psql example provided and can also find more.

The fact that something listens on port 5432 doesn't mean psql is ready and accepting connections. But probably @Lucretiel enumerated the most common cases; to wait for a port and waiting for a pattern in stdout/stderr as I also do here and here but consider user may need to process the output a bit like I do here so a command is the most flexible approach.

IuryAlves commented 8 years ago

+1

AaronOpfer commented 8 years ago

I also have this issue. However I believe instead of having a command to run to indicate the application has successfully started, I propose that instead there be a special value (user configurable?) outputted to STDOUT by the application that supervisord would use as a signal that the application is up and ready.

For instance, if an application takes anywhere between 10 to 20 seconds to load from a database, the application could output "SUPERVISOR-READY" to stdout, on its own line with no other spacing when it has completed successfully. Supervisor could then detect this output and mark the process as successfully started.

IuryAlves commented 8 years ago

I disagree. I don't want to change my program to work with supervisor. Supervisor must know that the program has started without change the program to do so.

AaronOpfer commented 8 years ago

If the string supervisor is looking for was user-configurable the application wouldn't necessarily need to change. If your application logs to stdout "Database loaded successfully", and that's when you consider your application to be alive, you could configure supervisor to grok for that.

Lucretiel commented 8 years ago

I agree with both points here. The idea of having to modify your application to work with supervisor is, of course, silly, but I think having a supervisor await a message (probably matching a regex) is the best solution here (along with waiting for a port to be accepting connections). I don't like the script solution because all you've done is defer the problem. Sure, in an ideal world the check script will always terminate with 0 or error return code, but in that same world, the long-running processes we're supervising will terminate themselves if they encounter an error, too.

Lucretiel commented 8 years ago

That being said, I could probably be convinced that a script solution is appropriate, if it is sufficiently restrained in its usage. For instance, you could specify a script instead of a regex for monitoring stdout; in this case, all of the main command's stdout is tee'd into the stdin of that script, which then can monitor for any amount of arbitrarily complex output before exiting 0 for success or nonzero for error.

AaronOpfer commented 8 years ago

I agree with Lucretiel that using a script to accomplish this sounds like a bad idea because of all of the inherent problems of dealing with process (which supervisor itself was designed to solve). It also could tempt users into making poor choices with their configurations (I can imagine someone abusing this feature to make a "cache warmer" script run alongside their app, or some other kind of important startup-time initialization). On the other hand, a STDOUT/STDERR regex is simple, easy to understand and difficult to abuse.

mikeyg123 commented 8 years ago

Specifying a regex or even a static string to match in the stdout to indicate that a process has started would really help us a great deal. So far we've migrated nearly 100 different processes to our supervisor based environment but we have a few tricky ones remaining that can take a widely variable amount of time to restart. We also have a rolling restart script that restarts a batch of processes, waits for them to all start before moving on to the next batch. Having a reliable way to know that processes are properly up is crucial here to prevent service outages.

augustr commented 6 years ago

+1 for this feature request.

Currently using systemd services but want to have something that works in well in a docker container. Using Type=notify and systemd-notify with systemd services works well (although I've heard that there might be problems with systemd-notify) and something similar for supervisord would be great.

Jason-2020 commented 4 years ago

+1 for this feature. For example our app starts really slow and startsecs doesn't solve our problem. When someone restarts the app supervisorctl says it's RUNNING, but in reality app is just starting slowly and app crashes after minute or so because of bug. Supervisor restarts it again and again and this bevahiour doesn't count as ERROR / FATAL state because of process was already up for some time. If supervisor could implement health check command it would truly make sense to assume app is up and RUNNING.

lfxx commented 3 years ago

+1 for this feature,any update here?

Supervisor / supervisor

Better way to determine if process has started #584