LearnBoost / cluster

Node.JS multi-core server manager with plugins support.
http://learnboost.github.com/cluster
MIT License
2.29k stars 159 forks source link

Spiky EINVAL #47

Closed dvv closed 13 years ago

dvv commented 13 years ago

The simplest server test.js:

var cluster, http, server;
cluster = require('cluster');
http = require('http');
server = http.createServer(function(req, res) {
  var body;
  console.log('%s %s', req.method, req.url);
  body = 'Hello World';
  res.writeHead(200, {
    'Content-Length': body.length
  });
  return res.end(body);
});
cluster(server).set('workers',4).use(cluster.debug()).use(cluster.reload(['test.js'])).use(cluster.stats()).use(cluster.repl(30000)).listen(3000);

Below is typical debug output when starting workers. What can be the reason?

info - master started info - worker 0 spawned info - worker 1 spawned info - worker 2 spawned info - worker 3 spawned info - worker 0 connected info - worker 3 connected Error: EINVAL, Invalid argument at IOWatcher.callback (net.js:878:24) error - worker 0 uncaught exception EINVAL, Invalid argument Error: EINVAL, Invalid argument at IOWatcher.callback (net.js:878:24) error - worker 3 uncaught exception EINVAL, Invalid argument warning - worker 0 died info - worker 0 spawned warning - worker 3 died info - worker 3 spawned info - worker 1 connected info - worker 2 connected info - listening for connections Error: EINVAL, Invalid argument at IOWatcher.callback (net.js:878:24) error - worker 2 uncaught exception EINVAL, Invalid argument Error: EINVAL, Invalid argument at IOWatcher.callback (net.js:878:24) error - worker 1 uncaught exception EINVAL, Invalid argument warning - worker 2 died info - worker 2 spawned warning - worker 1 died info - worker 1 spawned info - worker 3 connected info - worker 0 connected info - worker 1 connected info - worker 2 connected info - shutting down warning - kill(SIGKILL) info - shutdown complete warning - worker 3 died debug - exit

dvv commented 13 years ago

That is when colorful output is not so good ;) Lemme recatch

dvv commented 13 years ago
  info master started
  info worker 0 spawned
  info worker 1 spawned
  info worker 0 connected
Error: EINVAL, Invalid argument
    at IOWatcher.callback (net.js:878:24)
  error worker 0 uncaught exception EINVAL, Invalid argument
  warning worker 0 died
  info worker 0 spawned
  info worker 1 connected
  info listening for connections
  info worker 0 connected
  info shutting down
  warning kill(SIGKILL)
  warning worker 0 died
  warninf worker 1 died
  info shutdown complete
  debug - exit
TheDeveloper commented 13 years ago

I am also experiencing this issue. It seems that when the workers are spawned, one of them dies initially but then survives when it respawns and everything continues to operate normally. It also intermittently works without errors, but usually, there is an EINVAL, Invalid Argument error for at least one of the workers, causing it to respawn.

I'm on x64 Ubuntu 9.04 LTS with 2 cores using node v0.4.2 and cluster 0.3.3

tj commented 13 years ago

almost makes me think that "listening" has improperly been emitted on Ubuntu, although I am not sure if this would be node, libev etc. I bet if you do a setTimeout() delaying the accept() calls it would work, if so I think this might be a node bug

dvv commented 13 years ago

which line to try to wrap by setTimeout()?

weepy commented 13 years ago

any ideas on this one ?

dvv commented 13 years ago

Suspended usage so far. Though tend to treat the bug as not hard one -- eventually, all required workers do get started :)

bkw commented 13 years ago

+1

Linux 2.6.31-302-ec2 #7-Ubuntu SMP Tue Oct 13 19:55:22 UTC 2009 x86_64 GNU/Linux

Current node from github (4a9f2de956)

Error: EINVAL, Invalid argument
    at IOWatcher.callback (net.js:903:24)
perezd commented 13 years ago

experiencing same issue here (Ubuntu LTS, 23bit, 1 core.)

tonymilne commented 13 years ago

Having the same problems on Ubuntu 10.04 LTS (lucid) using node 0.4.2 and cluster 0.4.2 - any help would be appreciated.

gaspo commented 13 years ago

the same problem with node 0.4.4, cluster 0.5.2 Linux 2.6.26-2-amd64 #1 SMP x86_64 GNU/Linux

Error: EINVAL, Invalid argument at IOWatcher.callback (net.js:905:24)

frank06 commented 13 years ago

Same problem on Ubuntu 10.04.1

christophsturm commented 13 years ago

it seems the problem is that the file descriptor is not valid, this is the line in net.js that triggers the exception: var peerInfo = accept(self.fd);

dvv commented 13 years ago

Sure. This means accept is called before fd has arrived to worker from IPC connection.

christophsturm commented 13 years ago

this is not in any way related to ubuntu.the testsuite also fails in arch linux.

tj commented 13 years ago

good to know, just coincidence I guess

ybogdanov commented 13 years ago

same problem on:

Linux dev2 2.6.18-194.3.1.el5.028stab069.6xen #1 SMP Wed May 26 18:35:38 MSD 2010 x86_64 x86_64 x86_64 GNU/Linux

cluster v0.5.5

bkw commented 13 years ago

So it's not limited to ubuntu after all.

3rd-Eden commented 13 years ago

Same issue here on debian 5.0

tj commented 13 years ago

wish I could reproduce this at least once, might have to fire up a VM, gah

frank06 commented 13 years ago

it's pretty easy to get those environments running with vagrant

tj commented 13 years ago

yeah i know im just lazy :D

mnutt commented 13 years ago

@dvv in lib/worker.js, around line 87, replacing the self.stdin.on('fd', ...) line with this made the problem go away for me:

var self = this;
self.stdin.on('fd', function(fd) {
  setTimeout(function() {
    self.server.listenFD(fd);
  }, 1000);
});

So yeah, @visionmedia, your assessment sounds right.

dvv commented 13 years ago

Sure. Wonder whether nextTick() will do, or really some timeout must be taken?

tj commented 13 years ago

@mnutt hmm interesting, I took a look at net.js and saw this (from listenFD): the emit() is immediate, so it's obviously expecting the syscalls to have occurred already, but since I spawn the workers before "listening" this makes perfect sense now, although kinda strange that it's fine on osx

    Server.prototype._startWatcher = function() {
      this.watcher.set(this.fd, true, false);
      this.watcher.start();
      this.emit('listening');
    };

we could try just spawning the workers after listening, should still be fine

lackac commented 13 years ago

We experienced the same issue this weekend on our staging server. Additionally, the cluster was being monitored by god which caused a never-ending respawn cycle. The circumstances in which this issue occurred is not clear to me though.

tj commented 13 years ago

@lackac yeah we need a better way to detect this, currently I have a threshold on boot of 60 seconds (in NODE_ENV=production), within which cluster will consider it a cyclic restart since it should not be restarting that soon, and will delay the restart for another 60 seconds to at least provide some sort of control instead of recursively spawning as fast as possible

that being said I made some tweeks to cluster's internals that should fix this EINVAL problem so the next release might help

rgabo commented 13 years ago

+1 for next release to fix this

tj commented 13 years ago

releasing riiiight now

lackac commented 13 years ago

@visionmedia we upgraded on our server to 0.6.1 after you released it and haven't seen this issue come up since. Thanks for the fix.

tj commented 13 years ago

@lackac awesome sounds good, I will close this for now

diversario commented 13 years ago

I started using cluster.exception and mailbox is flooded with EINVALs. It wasn't happening before today, I was on 0.6.3. Started for seemingly no reason. I upgraded to 0.6.5 and still getting exceptions.

I'm on node 0.4.9, Ubuntu 11.04.

The error is:

Error: EINVAL, Invalid argument
    at IOWatcher.callback (net.js:916:24)
tj commented 13 years ago

damn! I found some more bugs so I'm releasing another patch-level soon

diversario commented 13 years ago

Updated to 0.6.8, so far no emails from exception, I'm cautiously optimistic :) Gonna report back by the end of the day or if I get those emails before that.

tj commented 13 years ago

haha ok great, sounds good

diversario commented 13 years ago

Looks like EINVAL issue is solved. Or at least something else changed and it doesn't trigger, but I didn't really do anything. So, thanks! :)

tj commented 13 years ago

interesting, that's good though :D