LearnBoost / cluster

Node.JS multi-core server manager with plugins support.
http://learnboost.github.com/cluster
MIT License
2.29k stars 159 forks source link

Cluster 0.6.6 and up: Workers die on startup in unix_dgram bind #129

Closed brettkiefer closed 13 years ago

brettkiefer commented 13 years ago

Since checkin b72287be98a7e32394533525b9f2f91cc78ece03 (to close #126), it looks like starting Cluster with more than a few workers causes some to die and restart.

Repro: npm install cluster cd node_modules/cluster edit test.js to start 20 workers instead of 4 node test.js

Expected: 20 workers start

Observed: Several workers start, then a bunch fail, and the server shuts down:

$ node test.js
  info - master started
  info - worker 0 spawned
  info - worker 1 spawned
  info - worker 2 spawned
  info - worker 3 spawned
  info - worker 4 spawned
  info - worker 5 spawned
Error in unix_dgram bind of /tmp/cluster.51345.client.sock
Error in unix_dgram bind of /tmp/cluster.51345.client.sock
  info - worker 6 spawned
  info - worker 7 spawned
  info - worker 8 spawned
  info - worker 9 spawned
  info - worker 10 spawned
  info - worker 11 spawned
  info - worker 12 spawned
  info - worker 13 spawned
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
Error in unix_dgram bind of /tmp/cluster.51345.client.sock
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
Error in unix_dgram bind of /tmp/cluster.51345.client.sock
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
  info - worker 14 spawned
  info - worker 15 spawned
Error in unix_dgram bind of /tmp/cluster.51345.client.sock
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
Error in unix_dgram bind of /tmp/cluster.51345.client.sock
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
Error in unix_dgram bind of /tmp/cluster.51345.client.sock
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
  info - worker 16 spawned
  info - worker 17 spawned
Error in unix_dgram bind of /tmp/cluster.51345.client.sock
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
  info - worker 18 spawned
  info - worker 19 spawned
  info - listening for connections
  warning - worker 0 died
  info - worker 0 spawned
Error in unix_dgram bind of /tmp/cluster.51345.client.sock
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
  warning - worker 1 died
  info - worker 1 spawned
  warning - worker 2 died
  info - worker 2 spawned
  warning - worker 5 died
  info - worker 5 spawned
  warning - worker 9 died
Error in unix_dgram bind of /tmp/cluster.51345.client.sock
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
Error: EADDRINUSE, Address already in use
    at dgram.js:125:19
  info - worker 9 spawned
  warning - worker 10 died
  info - worker 10 spawned
  warning - worker 11 died
  info - worker 11 spawned
  info - worker 3 connected
  info - worker 8 connected
  info - worker 6 connected
  info - worker 14 connected
  info - worker 12 connected
  info - worker 13 connected
  info - worker 15 connected
  info - worker 18 connected
  info - worker 17 connected
  info - worker 19 connected
  warning - worker 4 died
  info - worker 4 spawned
  warning - worker 7 died
  info - worker 7 spawned
^C  warning - worker 16 died
  info - worker 16 spawned
  info - shutting down
  warning - kill(SIGKILL)
  info - shutdown complete
  warning - worker 0 died
  warning - worker 1 died
  warning - worker 3 died
  warning - worker 6 died
  warning - worker 8 died
  warning - worker 9 died
  warning - worker 12 died
  warning - worker 13 died
  warning - worker 14 died

Cluster detected over 20 worker deaths in the first
20 seconds of life, there is most likely
a serious issue with your server.

aborting.
brettkiefer commented 13 years ago

Reproduced on Ubuntu Linux and FreeBSD 8.2

tj commented 13 years ago

still an issue? I had this with the tests if I had old socket files laying around, but I'm pretty sure node unlink()s them first anyway

tj commented 13 years ago

oh my bad I think I know what it is

brettkiefer commented 13 years ago

Yes, I'm still seeing this in 0.6.7.

tj commented 13 years ago

the workers were trying to bind to the same client path due to a typo using the master's PID. Though that should be irrelevant now because I just removed the client socket all together it's not needed right now

brettkiefer commented 13 years ago

That does the trick for me.

brettkiefer commented 13 years ago

Actually, wait, when I install 0.6.8 clean and run 'node test.js' , I get this error:


Error in unix_dgram bind of test/cluster.51810.server.sock
Error: ENOENT, No such file or directory
    at dgram.js:125:19
  warning - kill(SIGKILL)
Error: ENOENT, No such file or directory
    at dgram.js:125:19
tj commented 13 years ago

ignore test.js, it should be npmignored, it's just a file I use to test arbitrary things, try make test (or just make)

brettkiefer commented 13 years ago

Ok. Yep, looks good in actual use. Thanks!