LearnBoost / cluster

Node.JS multi-core server manager with plugins support.
http://learnboost.github.com/cluster
MIT License
2.29k stars 159 forks source link

Workers exception handling is broken? #91

Closed Arech closed 13 years ago

Arech commented 13 years ago

Tried to add random exception in standalone.js worker use-case and got exception inside logger plugin in logger.js:81

master.on('worker exception', function(worker, err){ 81: log.error('worker ' + worker.id + ' uncaught exception ' + err.message); });

while trying to reference err.message. A closer look into code shows that Master.workerException() function expects two parameters:

master.js:844 Master.prototype.workerException = function(worker, err){

while Worker passes only one - exception object:

worker.js:104 self.master.call('workerException', err);

self.master.call('workerException', err);

tj commented 13 years ago

the worker's call() method is augmented to supply itself

tj commented 13 years ago

basically it just sends it's own id as well which is mapped in the master process to a Worker instance

Arech commented 13 years ago

Mmmm... Well, yeah... But obviously it is Error object passed as "worker"-parameter to master.on('worker exception', function(worker, err) in logger.js:80, not the Worker object as the code expects. And the second parameter is undefined in that case

May be the error is specific to standalone use-case? I think you may try to replicate the exception by adding .use(logger()) and throwing exception somewhere in setInterval(function(){}) for worker branch of standalone.js. Something like:

  var cluster = require('cluster')
, proc = cluster()
  .set('socket path', '/var/tmp')
  .use(cluster.logger('_logs',7))
  .start();

if (proc.isWorker) {
    var id = process.env.CLUSTER_WORKER;
    console.log(' worker #%d started', id);
    setInterval(function(){
        console.log(' processing job for worker #%d', id);
        if (Math.random()>0.8) throw new Error('Oups, hello exception!');
    }, 1000);   
} else {
    setTimeout(function(){
        proc.close();
    }, 10000);
}
Arech commented 13 years ago

By the way, here is the part of receiver.js:

        var worker
          , obj = JSON.parse(this._buf[id]);
        this._buf[id] = '';
        if ('number' == typeof obj.id) worker = this.children[obj.id];
        this.invoke(obj.method, obj.args, worker);

If for any reason obj.id is not a number, worker variable is going to be undefined, and that is unexpected...

Furthemore, if you put something like console.log("Id is %d",self.id); into this.master.call overload in worker.js:69, you will get a "Id is NaN" message.

Looks like the bug is really related to standalone use-case...

Arech commented 13 years ago

Yes. Looks like someone completely forgot to set worker.id in spawned process. Put

worker.id = parseInt( process.env.CLUSTER_WORKER );

into master.js:256 after var worker = new Worker(this); and it will start to work like a charm without exceptions as expected.

Btw, first IF() in Master.spawnWorker is better to refactor into something like

if ('number' == typeof id) {  
    worker = new Worker(this).spawn(id)
    this.children[id] = worker;
    //worker.id = id;
  } else {// generate an id
    id=this.children.length;
    worker = new Worker(this).spawn(id);
    this.children.push(worker);
  }
  worker.id = id;

or you'll miss worker.id in else{} branch. I'm not sure it is essential, because there is a this.id=id construct in Worker.spawn(), but why there was a worker.id = id; statement after this.children[id] = worker; inside of if(){} ?

tj commented 13 years ago

the id is already generated, when the id is explicitly set it's so that a worker can die / be replaced but retain the same id instead of just pushing to the array creating a wide range of ids. The only time the id should not map properly like you are mentioning is when a worker fails immediately before it's id is known (via the socketpair)

Arech commented 13 years ago

Emmm, I'm sorry, I don't understand how to interpret you answer. Have you reproduced the exception in standalone use-case ?

Arech commented 13 years ago

Let me try to comment

the id is already generated, when the id is explicitly set it's so that a worker can die / be replaced but retain the same id instead of just pushing to the array creating a wide range of ids. The only time the id should not map properly like you are mentioning is when a worker fails immediately before it's id is known (via the socketpair)

There are two "storage places" of ID's in the cluster: ID of worker object inside master process (it is really explicitly set and worker can die and be replaced..., yes, true), and the ID of worker inside of spawned worker process. Looks like the latter ID is not set anywhere (at least in standalone use-case). This ID is passed to the spawned process via process.env.CLUSTER_WORKER variable. It looks like this var is never assigned to the ID used by Master.call() function anywhere in spawned process. I've provided a patch for this.

tj commented 13 years ago
  var obj = {
      method: 'connect'
    , args: [worker.id, this.options]
  };
  worker.sock.write(utils.frame(obj), 'ascii', this.fd);

this is where it's passed to the worker, along with some options and the fd

tj commented 13 years ago

so it should be fine unless the worker has been killed before even receiving that initial data, but yeah we could CLUSTER_WORKER as well (which I just added recently)

Arech commented 13 years ago

Dear visionmedia. I haven't written this long thread because I know, how it should work. I have written it because I know how it works under given conditions (standalone, worker is not killed before receiving inital data). Did you run the test case?

tj commented 13 years ago

took a look at your example, the worker definitely does not seem to be receiving the connect call. looking into it

tj commented 13 years ago

oh nvm i think i know what it is haha should be dead easy to fix