LearnBoost / cluster

Node.JS multi-core server manager with plugins support.
http://learnboost.github.com/cluster
MIT License
2.29k stars 159 forks source link

unbalanced socket.io connections with flashsocket #114

Closed fabware closed 13 years ago

fabware commented 13 years ago

Hi,

It's great to see cluster offers a solution to cluster up socket.io processes. I tested it with flashsocket connections. One obvious problem I found is the connections are greatly unbalanced. Here is some stats:

[d@b ~]$ telnet 127.0.0.1 18888 Trying 127.0.0.1... Connected to localhost.localdomain (127.0.0.1). Escape character is '^]'. cluster> stats()

Master os: Linux 2.6.18-194.el5 state: active started: Fri, 17 Jun 2011 14:13:13 GMT uptime: 20 hours restarts: 0 workers: 12 deaths: 0

Resources load average: 6.56 6.79 6.65 cores utilized: 12 / 16 memory at boot (free / total): 28.49gb / 31.42gb memory now (free / total): 14.26gb / 31.42gb

Workers connections total: 5168007 connections active: 782549 requests total: 33 0: 20 hours 128735|515708|5 1: 20 hours 13444|662313|2 2: 20 hours 108925|419495|5 3: 20 hours 48034|157842|1 4: 20 hours 4665|110692|0 5: 20 hours 11952|493073|1 6: 20 hours 37428|115796|1 7: 20 hours 121888|478094|1 8: 20 hours 15864|1082414|8 9: 20 hours 132437|561169|6 10: 20 hours 60469|212005|2 11: 20 hours 98708|359406|1

[d@c ~]$ telnet 127.0.0.1 18888 Trying 127.0.0.1... Connected to localhost.localdomain (127.0.0.1). Escape character is '^]'. cluster> stats()

Master os: Linux 2.6.18-194.el5 state: active started: Sat, 18 Jun 2011 11:05:47 GMT uptime: 1.6 days restarts: 0 workers: 6 deaths: 3

Resources load average: 0.70 0.70 0.73 cores utilized: 6 / 8 memory at boot (free / total): 8.57gb / 15.67gb memory now (free / total): 4.15gb / 15.67gb

Workers connections total: 7283112 connections active: 119267 requests total: 78 0: 1.6 days 1380|1544062|12 1: 14 hours 96272|349470|5 2: 28 minutes 13410|31661|0 3: 1.6 days 2135|2272937|29 4: 22 hours 4943|1867459|16 5: 1.6 days 1127|1217523|16

From stats shown above, current connections unbalancing is much worse than total connections. Is it possible to have better load balance worker processes according to worker load?

Best regards! can

tj commented 13 years ago

that's all up to the kernel

TooTallNate commented 13 years ago

i.e. the workers aren't under enough load to use the other ones. DO MOAR WORK!!

tj commented 13 years ago

moarrrr. that's a lot of connections, and not many requests

tj commented 13 years ago

also the load avg doesn't look bad at relative to the cpus so I imagine that's why it's spread the way it is

fabware commented 13 years ago

Thanks for your inputs.

Current connections stays so high because socket.io-node seems not closing it properly, I have initial observation but I don't have concrete numbers right now. The node may crash before get enough load to be take off. I'll try the latest 0.7 and report if I still have problems. It's another topic anyway.

I don't understand what does "request" mean for TCP connections. The server have a lot of connection and disconnection. The server doesn't receiving messages from connected clients, all messages are been pushed(broadcast) from server, maybe this makes the "request" is so few. Worker load get much higher when broadcasting is in progress.

So, load balance already been done by worker load? Is there any document to read on how it works?

tj commented 13 years ago

your requests shown there are probably just the pages themselves, and the rest would be socket.io, pretty high numbers though! I haven't used cluster with socket.io myself but that's interesting. Basically all the workers are listening on the same socket for connections, and it's a race to accept(), whichever process is not busy and accept()s first wins, so you could have reasonably unbalanced workers, however it's usually an indication that it's really not that busy.

fabware commented 13 years ago

Thanks for your kindly explanation.

Closing this issue right now. If you guys think it is a problem, feel free to reopen it, I may help test it:)