LearnBoost / cluster

Node.JS multi-core server manager with plugins support.
http://learnboost.github.com/cluster
MIT License
2.29k stars 159 forks source link

Standalone buggy restart with SIGUSR2 #125

Closed ybogdanov closed 13 years ago

ybogdanov commented 13 years ago

Hi,

I'm using the examples/standalone.js example to test this.

/**
 * Module dependencies.
 */

var cluster = require('../');

var proc = cluster()
  .set('workers', 4)
  .use(cluster.debug())
  .start();

if (proc.isWorker) {
  var id = process.env.CLUSTER_WORKER;
  console.log('  worker #%d started', id);
  setInterval(function(){
    console.log('  processing job from worker #%d', id);
  }, 1000);
}

When you start the cluster, everything goes fine:

$ node examples/standalone.js 
  info - master started
  info - worker 0 spawned
  info - worker 1 spawned
  info - worker 2 spawned
  info - worker 3 spawned
  info - listening for connections
  worker #0 started
  info - worker 0 connected
  worker #1 started
  info - worker 1 connected
  worker #2 started
  info - worker 2 connected
  worker #3 started
  info - worker 3 connected

Node processes:

$ ps aux | grep node
   11537 /opt/local/bin/node ~/node-cluster/examples/standalone.js #worker
   11538 /opt/local/bin/node ~/node-cluster/examples/standalone.js #worker
   11539 /opt/local/bin/node ~/node-cluster/examples/standalone.js #worker
   11540 /opt/local/bin/node ~/node-cluster/examples/standalone.js #worker
   11536 node examples/standalone.js #master

SIGUSR2 spawns new workers but doesn't kill the old ones

$ kill -USR2 11536
$ ps aux | grep node
   11537 /opt/local/bin/node ~/node-cluster/examples/standalone.js #old worker
   11538 /opt/local/bin/node ~/node-cluster/examples/standalone.js #old worker
   11539 /opt/local/bin/node ~/node-cluster/examples/standalone.js #old worker
   11540 /opt/local/bin/node ~/node-cluster/examples/standalone.js #old worker
   11536 node examples/standalone.js #old master
   11625 /opt/local/bin/node ~/node-cluster/examples/standalone.js #new worker
   11626 /opt/local/bin/node ~/node-cluster/examples/standalone.js #new worker
   11627 /opt/local/bin/node ~/node-cluster/examples/standalone.js #new worker
   11628 /opt/local/bin/node ~/node-cluster/examples/standalone.js #new worker
   11624 /opt/local/bin/node ~/node-cluster/examples/standalone.js #new master
ybogdanov commented 13 years ago

Ok, This problem appears only if you use cluster.start() (without socket binding). If you edit standalone.js and change .start() to .listen('/path/to/socket'), restart will work correctly:

var proc = cluster()
  .set('workers', 4)
  .use(cluster.debug())
  .listen('/tmp/cluster.sock');
tj commented 13 years ago

interesting, I'll have a look at this thanks for the report

tj commented 13 years ago

should be fixed in the most recent release, forgot to tag this issue

ybogdanov commented 13 years ago

nope, it doesn't :) I'll try to make a test for it.

ybogdanov commented 13 years ago

With 0.6.6 even listen() behave same as start():

var proc = cluster()
  .set('workers', 4)
  .use(cluster.debug())
  .listen('/tmp/cluster.sock');

(SIGUSR2 doesn't kills old processes)

felixge commented 13 years ago

I have a test case:

https://github.com/felixge/cluster/commit/6da2f0ed938fb85410c0c6dbfefdce259d2e6139

It's not perfect, but in most cases it should exit with 1 right now. The problem is a race condition here:

https://github.com/LearnBoost/cluster/blob/master/lib/master.js#L720

There is no guarantee that 'listening' will not already have fired when the new master receives the 'connectMaster' instruction from its parent. This race condition probably affects both regular and standalone clusters, but the standalone ones are more likely to run into it because their 'listening' method is fired earlier:

https://github.com/LearnBoost/cluster/blob/master/lib/master.js#L275

I'm not entirely sure what fix to go for, but one idea is to keep track if 'listening' already fired, and if so, don't wait for it in the connectMaster() method.

--fg

tj commented 13 years ago

yeah I think we can get away with simply tracking if has[ not] happened yet. It's not necessarily required but I think it's best and for some future functionality that the parent sticks around until the child is fully booted and ready to serve.

tj commented 13 years ago

thanks for the test case! I didnt realize until yesterday that I had none for stand-alone

felixge commented 13 years ago

Thanks for the fix! I didn't realize that cluster had a standalone mode until I saw this ticket : )! That's really awesome for running worker scripts!

tj commented 13 years ago

totally! It works great with Kue

ybogdanov commented 13 years ago

Thank you, guys!

I'm working on node-cloud - queue-based jobs processing in a cloud. Cluster's standalone feature makes it multicore-scalable in few lines of code!

Great job!

tj commented 13 years ago

@0ctave cool man!

felixge commented 13 years ago

totally! It works great with Kue

Wow, that looks really rad. I'll check it out, if I can use it I'll send patches : )