celluloid / dcell

UNMAINTAINED: See celluloid/celluloid#779 - Actor-based distributed objects in Ruby based on Celluloid and 0MQ
http://celluloid.io
MIT License
595 stars 65 forks source link

Communication with <node> interrupted #19

Closed TvL2386 closed 12 years ago

TvL2386 commented 12 years ago

Hi,

I'm running 3 ubuntu 12.04 amd64 nodes. When starting dcell in an irb session, I get communication interrupted. This does not seem good...

#first node started:
1.9.3p194 :001 > require 'dcell'
 => true
1.9.3p194 :002 > DCell.start addr: 'tcp://10.0.0.14:2042'
I, [2012-06-01T17:26:57.145755 #3691]  INFO -- : Connected to ubu02
 => #<Celluloid::Supervisor(DCell::Group):0xa952cc>
I, [2012-06-01T17:28:22.362609 #3691]  INFO -- : Found node ubu01
I, [2012-06-01T17:28:27.368780 #3691]  INFO -- : Connected to ubu01
I, [2012-06-01T17:29:43.682388 #3691]  INFO -- : Found node tole
I, [2012-06-01T17:29:48.701250 #3691]  INFO -- : Connected to tole
W, [2012-06-01T17:30:13.721359 #3691]  WARN -- : Communication with tole interrupted
W, [2012-06-01T17:30:13.722086 #3691]  WARN -- : Communication with ubu01 interrupted
I, [2012-06-01T17:30:18.746543 #3691]  INFO -- : Connected to tole
I, [2012-06-01T17:30:18.747069 #3691]  INFO -- : Connected to ubu01
W, [2012-06-01T17:30:37.600724 #3691]  WARN -- : Communication with ubu01 interrupted
W, [2012-06-01T17:30:38.769308 #3691]  WARN -- : Communication with tole interrupted
I, [2012-06-01T17:30:38.774246 #3691]  INFO -- : Connected to tole
I, [2012-06-01T17:30:38.774640 #3691]  INFO -- : Connected to ubu01
W, [2012-06-01T17:30:52.644008 #3691]  WARN -- : Communication with ubu01 interrupted
I, [2012-06-01T17:30:52.671897 #3691]  INFO -- : Connected to ubu01
W, [2012-06-01T17:31:02.674378 #3691]  WARN -- : Communication with ubu01 interrupted
W, [2012-06-01T17:31:03.795929 #3691]  WARN -- : Communication with tole interrupted
I, [2012-06-01T17:31:08.814475 #3691]  INFO -- : Connected to tole
I, [2012-06-01T17:31:08.815058 #3691]  INFO -- : Connected to ubu01

#second node:
1.9.3p194 :001 > require 'dcell'
 => true
1.9.3p194 :002 > DCell.start addr: 'tcp://10.0.0.10:2043', directory: { id: 'ubu02', addr: 'tcp://10.0.0.14:2042' }
I, [2012-06-01T17:28:17.302204 #3687]  INFO -- : Connected to ubu02
I, [2012-06-01T17:28:17.302992 #3687]  INFO -- : Connected to ubu01
 => #<Celluloid::Supervisor(DCell::Group):0x83cc14>
I, [2012-06-01T17:29:52.467079 #3687]  INFO -- : Found node tole
I, [2012-06-01T17:29:57.486929 #3687]  INFO -- : Connected to tole
W, [2012-06-01T17:30:07.488696 #3687]  WARN -- : Communication with ubu02 interrupted
W, [2012-06-01T17:30:07.489085 #3687]  WARN -- : Communication with tole interrupted
I, [2012-06-01T17:30:07.509395 #3687]  INFO -- : Connected to ubu02
I, [2012-06-01T17:30:07.510559 #3687]  INFO -- : Connected to tole
W, [2012-06-01T17:30:23.741460 #3687]  WARN -- : Communication with ubu02 interrupted
W, [2012-06-01T17:30:23.742249 #3687]  WARN -- : Communication with tole interrupted
I, [2012-06-01T17:30:23.751953 #3687]  INFO -- : Connected to ubu02
I, [2012-06-01T17:30:23.753235 #3687]  INFO -- : Connected to tole
W, [2012-06-01T17:30:33.755624 #3687]  WARN -- : Communication with tole interrupted
I, [2012-06-01T17:30:33.772550 #3687]  INFO -- : Connected to tole

#third node:
1.9.3p194 :001 > require 'dcell'
 => true
1.9.3p194 :002 > DCell.start addr: 'tcp://10.0.0.6:2044', directory: { id: 'ubu02', addr: 'tcp://10.0.0.14:2042' }
I, [2012-06-01T17:29:38.626024 #4828]  INFO -- : Connected to ubu02
I, [2012-06-01T17:29:38.626869 #4828]  INFO -- : Connected to tole
 => #<Celluloid::Supervisor(DCell::Group):0x1567768>
1.9.3p194 :003 > I, [2012-06-01T17:30:02.639891 #4828]  INFO -- : Found node ubu01
I, [2012-06-01T17:30:02.677549 #4828]  INFO -- : Connected to ubu01
W, [2012-06-01T17:30:22.513723 #4828]  WARN -- : Communication with ubu02 interrupted
I, [2012-06-01T17:30:22.531024 #4828]  INFO -- : Connected to ubu02
W, [2012-06-01T17:30:27.566748 #4828]  WARN -- : Communication with ubu01 interrupted
W, [2012-06-01T17:30:32.532805 #4828]  WARN -- : Communication with ubu02 interrupted
I, [2012-06-01T17:30:32.551753 #4828]  INFO -- : Connected to ubu02
I, [2012-06-01T17:30:32.552106 #4828]  INFO -- : Connected to ubu01
W, [2012-06-01T17:30:47.618859 #4828]  WARN -- : Communication with ubu01 interrupted
W, [2012-06-01T17:30:47.619192 #4828]  WARN -- : Communication with ubu02 interrupted
I, [2012-06-01T17:30:47.647090 #4828]  INFO -- : Connected to ubu02
I, [2012-06-01T17:30:47.647379 #4828]  INFO -- : Connected to ubu01
W, [2012-06-01T17:30:57.648953 #4828]  WARN -- : Communication with ubu02 interrupted
W, [2012-06-01T17:30:57.649156 #4828]  WARN -- : Communication with ubu01 interrupted
I, [2012-06-01T17:30:57.670792 #4828]  INFO -- : Connected to ubu02
I, [2012-06-01T17:30:57.671186 #4828]  INFO -- : Connected to ubu01

I'm running the followin versions:

root@ubu02:~# gem list | egrep 'dcell|zmq'
celluloid-zmq (0.10.0)
dcell (0.10.0)
ffi-rzmq (0.9.3)
root@ubu02:~# dpkg --list | grep zmq
ii  libzmq-dev                      2.1.11-1ubuntu1            ZeroMQ lightweight messaging kernel (development libraries and header files)
ii  libzmq1                         2.1.11-1ubuntu1            ZeroMQ lightweight messaging kernel (shared library)
root@ubu02:~# ruby -v
ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]
TvL2386 commented 12 years ago

FYI: When I just have 2 nodes, everything seems to work like I would suggest. The info service works perfectly!

tarcieri commented 12 years ago

It looks like you're not giving each of the nodes a unique ID. Perhaps I can try to derive the node ID from the address if it isn't given.

TvL2386 commented 12 years ago

It doesn't matter whether I give all nodes a unique I'd or not. Result is the same

TvL2386 commented 12 years ago

The id is the same as each host name if you don't specify it

tarcieri commented 12 years ago

So you're really running into this problem even if every node has a unique node ID?

TvL2386 commented 12 years ago

yep:

# node1
1.9.3p194 :001 > require 'dcell'
 => true 
1.9.3p194 :002 > DCell.start :id => "node1", :addr => "tcp://127.0.0.1:2042",
1.9.3p194 :003 >     :registry => {
1.9.3p194 :004 >         :adapter => 'redis',
1.9.3p194 :005 >         :host    => '127.0.0.1',
1.9.3p194 :006 >         :port    => 6379
1.9.3p194 :007?>     }
I, [2012-06-03T08:02:18.806961 #3410]  INFO -- : Connected to node1
 => #<Celluloid::Supervisor(DCell::Group):0xd36cb0> 
1.9.3p194 :008 > I, [2012-06-03T08:02:58.123416 #3410]  INFO -- : Found node node66
I, [2012-06-03T08:03:03.128523 #3410]  INFO -- : Connected to node66
I, [2012-06-03T08:03:28.393518 #3410]  INFO -- : Found node node67
I, [2012-06-03T08:03:33.402549 #3410]  INFO -- : Connected to node67
W, [2012-06-03T08:03:48.198882 #3410]  WARN -- : Communication with node66 interrupted
I, [2012-06-03T08:03:48.211829 #3410]  INFO -- : Connected to node66
W, [2012-06-03T08:04:03.222294 #3410]  WARN -- : Communication with node66 interrupted
W, [2012-06-03T08:04:03.222624 #3410]  WARN -- : Communication with node67 interrupted
I, [2012-06-03T08:04:03.240529 #3410]  INFO -- : Connected to node66
I, [2012-06-03T08:04:03.240930 #3410]  INFO -- : Connected to node67
W, [2012-06-03T08:04:23.481235 #3410]  WARN -- : Communication with node66 interrupted
W, [2012-06-03T08:04:23.481465 #3410]  WARN -- : Communication with node67 interrupted
I, [2012-06-03T08:04:28.295843 #3410]  INFO -- : Connected to node66
I, [2012-06-03T08:04:28.296402 #3410]  INFO -- : Connected to node67
W, [2012-06-03T08:04:38.496951 #3410]  WARN -- : Communication with node67 interrupted
W, [2012-06-03T08:04:43.307204 #3410]  WARN -- : Communication with node66 interrupted

# node66
1.9.3p194 :001 > require 'dcell'
 => true 
1.9.3p194 :002 > DCell.start :id => "node66", :addr => "tcp://127.0.0.1:2066",
1.9.3p194 :003 >     :directory => {
1.9.3p194 :004 >         :id   => 'node1',
1.9.3p194 :005 >         :addr => 'tcp://127.0.0.1:2042'
1.9.3p194 :006?>     }
I, [2012-06-03T08:02:53.112467 #3437]  INFO -- : Connected to node1
I, [2012-06-03T08:02:53.113040 #3437]  INFO -- : Connected to node66
 => #<Celluloid::Supervisor(DCell::Group):0xcf91a8> 
1.9.3p194 :007 > W, [2012-06-03T08:03:38.891409 #3437]  WARN -- : Communication with node1 interrupted
I, [2012-06-03T08:03:43.908094 #3437]  INFO -- : Found node node67
I, [2012-06-03T08:03:43.908460 #3437]  INFO -- : Connected to node1
I, [2012-06-03T08:03:48.431099 #3437]  INFO -- : Connected to node67
W, [2012-06-03T08:03:58.432590 #3437]  WARN -- : Communication with node67 interrupted
I, [2012-06-03T08:03:58.453965 #3437]  INFO -- : Connected to node67
W, [2012-06-03T08:04:18.475492 #3437]  WARN -- : Communication with node67 interrupted
I, [2012-06-03T08:04:18.484619 #3437]  INFO -- : Connected to node67
W, [2012-06-03T08:04:48.996539 #3437]  WARN -- : Communication with node1 interrupted

node67
1.9.3-p194 :002 > require 'dcell'
 => true 
1.9.3-p194 :003 > DCell.start :id => "node67", :addr => "tcp://127.0.0.1:2067",
1.9.3-p194 :004 >     :directory => {
1.9.3-p194 :005 >         :id   => 'node1',
1.9.3-p194 :006 >         :addr => 'tcp://127.0.0.1:2042'
1.9.3-p194 :007?>     }
I, [2012-06-03T08:03:23.381680 #3465]  INFO -- : Connected to node1
I, [2012-06-03T08:03:23.382468 #3465]  INFO -- : Connected to node67
 => #<Celluloid::Supervisor(DCell::Group):0x158914c> 
1.9.3-p194 :008 > 
1.9.3-p194 :009 >   I, [2012-06-03T08:03:33.896566 #3465]  INFO -- : Found node node66
I, [2012-06-03T08:03:38.902664 #3465]  INFO -- : Connected to node66
W, [2012-06-03T08:03:48.903876 #3465]  WARN -- : Communication with node1 interrupted
W, [2012-06-03T08:03:48.904233 #3465]  WARN -- : Communication with node66 interrupted
I, [2012-06-03T08:03:58.230268 #3465]  INFO -- : Connected to node1
I, [2012-06-03T08:03:58.230809 #3465]  INFO -- : Connected to node66
W, [2012-06-03T08:04:28.974255 #3465]  WARN -- : Communication with node1 interrupted
W, [2012-06-03T08:04:33.287819 #3465]  WARN -- : Communication with node66 interrupted
I, [2012-06-03T08:04:33.990289 #3465]  INFO -- : Connected to node1
I, [2012-06-03T08:04:33.990845 #3465]  INFO -- : Connected to node66
W, [2012-06-03T08:04:53.326536 #3465]  WARN -- : Communication with node1 interrupted

I've tried rbx-2.0.testing just for fun and it does exactly the same.

TvL2386 commented 12 years ago

as soon as there are more than 2 nodes, the communication interruptions start.

tarcieri commented 12 years ago

You're doing this all from irb... there are known issues with this and readline blocking every thread.

Can you try any of the following: 1) Disabling readline by putting IRB.conf[:USE_READLINE] = false in .irbrc 2) Putting your code in Ruby scripts instead of using irb 3) Using JRuby which doesn't have the readline-related problems

TvL2386 commented 12 years ago

I've put them in scripts (see https://gist.github.com/2864368)

Running them with ruby-1.9.3p194 gives the same result. Running them in seperate irb sessions with --noreadline gives the same result.

Running the three scripts as followed:

rvm use jruby-1.6.7
ruby --1.9 -rrubygems nodeX.rb

gives the same result...

Regards, Tom

knewter commented 12 years ago

for what it's worth, I'm seeing this as well and I'm definitely not running my examples from irb. http://github.com/knewter/skynet <-- if you follow the README, you see this after a little bit. Of course, that's not as simple as the example of the issue given here. Still, +1

TvL2386 commented 12 years ago

it doesn't matter whether you run it from irb or not, whether you use ruby-1.9.3 or jruby-1.6.7 or rubinius-2.0.0.testing... It's all the same.

tarcieri commented 12 years ago

I will investigate this further when I have time

adamgamble commented 12 years ago

I also am having this issue.

tarcieri commented 12 years ago

I have plans to make some pretty major changes to the way DCell works in general, and will probably be shifting back onto Zookeeper by default until the gossip protocol can be more stable

jessesanford commented 12 years ago

So what is the recommended way of getting the examples up and running? Zookeeper?

tarcieri commented 12 years ago

@therealjessesanford unfortunately Zookeeper is broken at the moment, so there's not a lot to do besides wait for Zookeeper support to be fixed or submit a patch :(

klarrimore commented 12 years ago

Any news on this?

tarcieri commented 12 years ago

No, sorry, I have mostly been spending my time working on Celluloid. I had planned to pick this up after Celluloid 0.12.0, however there were enough bugs in that release I really need to get Celluloid 0.12.1 out before I can take a look at DCell again.

klarrimore commented 12 years ago

No problem, can point me in the direction of some of the possible solutions you were thinking of? Maybe I'll take a swing at it.

tarcieri commented 12 years ago

In short: revert 9dc9245f904deccb

tarcieri commented 12 years ago

That said, your best bet for a first step would be to at least get DCell green on Celluloid 0.12.1 (unreleased at https://github.com/celluloid/celluloid master)

klarrimore commented 12 years ago

ok cool, that gives me somewhere to start

tarcieri commented 12 years ago

I reverted 9dc9245 in e3115f28. This should put us back on stable ground. I'm calling this issue solved.

klarrimore commented 12 years ago

yes, this works much better. thank you