LearnBoost / cluster

Node.JS multi-core server manager with plugins support.
http://learnboost.github.com/cluster
MIT License
2.29k stars 159 forks source link

Address in Use on Worker Restart? #121

Open jeffv opened 13 years ago

jeffv commented 13 years ago

This only happens once or twice per month and has been very difficult to debug, and nigh on impossible to replicate. Seeing as how the issue seems to original deep inside of cluster I thought I'd ping you and get your opinion.

What seems to be happening based on the log output is that cluster is attempting to restart the master process, and I'm not sure why.

RELEVANT CODE: var proc = cluster(app).set('workers', numberOfWorkers).listen(port);

if (!proc.isWorker) { console.log( ':::', ' Redman Mobile '.bold, mvc.version.magenta, ', loading on port ', String(port).yellow, ' with ', String(numberOfWorkers).magenta, ' worker(s):::' ); require('./lib/staticBuilder.js'); require('./lib/mobileServer/tasks.js'); }

ERROR LOG:

::: Redman Mobile 1.1.0 , loading on port 9296 with 2 worker(s):::

node.js:134 throw e; // process.nextTick error, or 'error' event on first tick ^ Error: EADDRINUSE, Address already in use at listen (/www/node/redman-mobile/node_modules/cluster/lib/master.js:791:17) at Master.startListening (/www/node/redman-mobile/node_modules/cluster/lib/master.js:784:5) at Master. (/www/node/redman-mobile/node_modules/cluster/lib/master.js:294:12) at Master.emit (events.js:81:20) at Server. (/www/node/redman-mobile/node_modules/cluster/lib/master.js:326:10) at Server.emit (events.js:61:17) at Server._startWatcher (net.js:1088:8) at Array. (net.js:1121:10) at EventEmitter._tickCallback (node.js:126:26)

tj commented 13 years ago

hmmmm only the original master should be bind()ing

jeffv commented 13 years ago

That's what I thought. If the master dies it should just die, right?

For some I'm getting the echo that's only intended to show when it isn't a worker. It just repeats ad-nauseum in our logs, which aren't very good. The whole thing doesn't go down either. At least one thread is still running, even though the others are giving a 500+ (I forget exactly which) error.

I've turned on cluster logging in hopes it'll provide more clues in the mean time.

tj commented 13 years ago

yeah if master is killed by anything but SIGKILL (obviously) the rest should crumble

tj commented 13 years ago

hmm yeah they dont seem to be, i'll look into it

tj commented 13 years ago

ff941d5 should help

jeffv commented 13 years ago

Yah. That certainly looks promising. I'm not in a position to deploy it just yet to test it, unfortunately. Still trying to deal with our own uncaught exceptions that are triggering the situation in the first place.

Here's the log output from a crash (in case it gives you any other fix ideas): [Sun, 17 Jul 2011 20:36:36 GMT] INFO master started [Sun, 17 Jul 2011 20:36:36 GMT] INFO master started [Sun, 17 Jul 2011 20:36:37 GMT] INFO master started [Sun, 17 Jul 2011 20:36:37 GMT] INFO master started [Sun, 17 Jul 2011 20:36:38 GMT] INFO master started [Sun, 17 Jul 2011 20:36:38 GMT] INFO master started [Sun, 17 Jul 2011 20:36:39 GMT] INFO master started [Sun, 17 Jul 2011 20:36:39 GMT] INFO master started [Sun, 17 Jul 2011 20:36:39 GMT] INFO master started [Sun, 17 Jul 2011 20:36:40 GMT] INFO master started [Sun, 17 Jul 2011 20:36:40 GMT] INFO master started [Sun, 17 Jul 2011 20:37:17 GMT] INFO master started [Sun, 17 Jul 2011 20:37:17 GMT] INFO spawned worker 0 [Sun, 17 Jul 2011 20:37:17 GMT] INFO spawned worker 1 [Sun, 17 Jul 2011 20:37:47 GMT] WARNING shutting down master [Sun, 17 Jul 2011 20:37:53 GMT] INFO master started [Sun, 17 Jul 2011 20:37:53 GMT] INFO spawned worker 0 [Sun, 17 Jul 2011 20:37:53 GMT] INFO spawned worker 1 [Sun, 17 Jul 2011 20:38:21 GMT] ERROR worker 0 died [Sun, 17 Jul 2011 20:38:21 GMT] INFO spawned worker 0 [Sun, 17 Jul 2011 20:38:25 GMT] ERROR worker 1 died [Sun, 17 Jul 2011 20:38:25 GMT] INFO spawned worker 1 [Sun, 17 Jul 2011 20:39:00 GMT] INFO master started [Sun, 17 Jul 2011 20:39:00 GMT] INFO master started [Sun, 17 Jul 2011 20:39:01 GMT] INFO master started [Sun, 17 Jul 2011 20:39:01 GMT] INFO master started [Sun, 17 Jul 2011 20:39:02 GMT] INFO master started [Sun, 17 Jul 2011 20:39:02 GMT] INFO master started [Sun, 17 Jul 2011 20:39:02 GMT] INFO master started [Sun, 17 Jul 2011 20:39:03 GMT] INFO master started [Sun, 17 Jul 2011 20:39:03 GMT] INFO master started [Sun, 17 Jul 2011 20:39:04 GMT] INFO master started [Sun, 17 Jul 2011 20:39:04 GMT] INFO master started [Sun, 17 Jul 2011 20:39:05 GMT] INFO master started [Sun, 17 Jul 2011 20:39:05 GMT] INFO master started [Sun, 17 Jul 2011 20:39:06 GMT] INFO master started [Sun, 17 Jul 2011 20:39:06 GMT] INFO master started [Sun, 17 Jul 2011 20:39:07 GMT] INFO master started [Sun, 17 Jul 2011 20:39:07 GMT] INFO master started [Sun, 17 Jul 2011 20:39:07 GMT] INFO master started [Sun, 17 Jul 2011 20:39:08 GMT] INFO master started [Sun, 17 Jul 2011 20:39:08 GMT] INFO master started [Sun, 17 Jul 2011 20:39:39 GMT] INFO master started [Sun, 17 Jul 2011 20:39:39 GMT] INFO spawned worker 0 [Sun, 17 Jul 2011 20:39:39 GMT] INFO spawned worker 1

Nothing before. And nothing after. At 20:39:39 GMT Monit Daemon detected the crash and restarted the App.