LearnBoost / up

Zero-downtime reloads and requests load balancer based on distribute.
540 stars 73 forks source link

Working Directory, Signals, Respawn and Internals. #37

Closed gjohnson closed 10 years ago

gjohnson commented 12 years ago

Sorry for the single pull request, committed too much at once...

Working Directory

This adds the ability for the CLI to accept the option -d/--directory to change the location of process.chdir.

Signals

This adds the following signals for the CLI to bind to. Right now they are bound no matter what, but perhaps there should be an option -s/--signals for opting out.

- `SIGTTIN`: Spawn a new worker to the round.
- `SIGTTOU`: Remove a worker from the round.
- `SIGTERM`: Hard shutdown of master and workers.
- `SIGQUIT`: Gracefully shutdown workers and the master.
- `SIGUSR1`: Gracefully shutdown workers but keep the master up.

Respawning

This adds the feature of spawning a new worker after an existing one exits. There are two conditions that must be satisfied in order for this to happen.

- The worker needs to have exited unexpectedly (e.g. worker.exitCode === 1). This is accomplished simply by checking the `exit` from exit event emitted by a child_process.
- The worker needs to have an uptime greater than the `uptimeThreshold` which can be set through the CLI options via `-u/--uptime`.

API and Internal Changes

To add some of the features above, I changed UpServer and Worker a good bit. Previously UpServer#reload was the only hook into spawning and shutting down workers. I have added:

- UpServer#shutdownWorkers();
- UpSerever#shutdown();
- UpServer#exit();
- UpServer#removeWorker();
- Worker#exit();

UpServer.reload() is alot slimmer now, these changes may be debatable. Before, if there was a workerTimeout set, new workers were spawned and upon the first worker being spawned, the old workers were shutdown. However, if there was no workerTimeout set, the workers were shutdown and after the last worker died, the new workers were spawned. Unless I was looking at it wrong, it did not make a whole lot of sense, so I trimmed it down to only the first case.

To handle the re-spawning without spinning into a cycical disaster, I simply bind to the terminate event in UpServer ctor. If the worker exitCode (which is set via child_process.on('exit')) is 1, we check the uptime of the worker against the uptimeThreshold. If all is good, we respawn.

tj commented 12 years ago

LGTM

gjohnson commented 12 years ago

Also mean't to mention in the reasoning for refactoring UpServer.reload(). I choose not to check workerTimeout as it was before because the worker child_process is in charge of exiting itself after Worker.shutdown() is invoked, via IPC msg.type "die". Shouldn't really matter what the workerTimeout is, just call shutdown and they will die at some point.

yields commented 12 years ago

+1

JonGretar commented 12 years ago

+1 . Badly needed.

rauchg commented 12 years ago

I'd call -d -C

rauchg commented 12 years ago

This is a great pull @gjohnson

gjohnson commented 12 years ago

Thanks @guille! Let me know if I can chop up the PR a bit to make life easier (if your planning on using it).

rauchg commented 12 years ago

That'd be my only change – then I'll merge

gjohnson commented 12 years ago

I will whip up some docs later this week for the new methods and signals.

frankmayer commented 12 years ago

+1 Great Stuff. Merge?

rauchg commented 12 years ago

It's not applying cleanly, and we need docs.

gjohnson commented 12 years ago

Yeah, there have been some other commits since this that would effect on the pull. Plus, I suck at life and still haven't written docs, soooo boring. :-) I can pull upstream in and handle the mangled merge my self. Stay tuned...

Sent from my iPhone

On Sep 16, 2012, at 2:08 PM, Guillermo Rauch notifications@github.com wrote:

It's not applying cleanly, and we need docs.

— Reply to this email directly or view it on GitHub.

rauchg commented 12 years ago

Yeah no rush @gjohnson thanks for taking the time in the first place.

gjohnson commented 12 years ago

I am working on merging these together. The biggest issue will be a conflict between my worker reload and the one recently merged in #42. The implementation is conceptually similar, I just check exit codes, etc. However #42 uses --keepalive and I use --uptime, IMO since we are doing http, keepalive could be confused with "keep-alive" headers... either works though, your call, naming is hard. lol