celluloid / dcell

UNMAINTAINED: See celluloid/celluloid#779 - Actor-based distributed objects in Ruby based on Celluloid and 0MQ
http://celluloid.io
MIT License
595 stars 65 forks source link

Block when acessing actor from foreign node #64

Closed HoneyryderChuck closed 9 years ago

HoneyryderChuck commented 10 years ago

I'm using DCell with redis registry. I've started these two scripts where both nodes were registered the first time to redis. Through it, they are sharing an access address and they communicate.

Problem is, if I stop one script, everytime the running script tries to access it, it blocks.

DCell::Node["app"] 
#=> #<DCell::Node[app] @addr="tcp://127.0.0.1:8888"> 
# Important, the node is there, but since the respective script is running, no actor is reachable. 
DCell::Node["app"]["time_server"]
#=> Connected to "app"
......

It seems that this happens because the stopped script has registered itself in redis the first time it ran, but it never unregistered itself when it stopped. Hence, the second script still gets the reference to its address.

So the question is, is there a way to unregister itself from the cluster on process stopped? Some way to prevent the block from happening?

tarcieri commented 10 years ago

The unregistering issue is a bug/absent feature of the Redis driver. It should set a TTL on the entries it puts in there so they auto-expire unless refreshed, and get refreshed at a periodic interval. But that's a lot of work for a driver that's not intended for production use. Feel free to implement it, but I'd suggest just using Zookeeper.

In general, if a node goes down, RPCs will block. DCell implements a timeout mechanism to eventually time them out when this happens, but otherwise it's expected behavior.

HoneyryderChuck commented 10 years ago

Ok, I'll go with the suggestion. So Zookeeper does the deregistering out-of-the-box, right? I don't need to add a new setting, do I?

About the timeout mechanism, is that an existing feature? I don't see it documented anywhere. When accessing a node which is unacessible, it just blocks indefinitely, no timeout triggered.

tarcieri commented 10 years ago

It was implemented at one point. Now I'm having trouble finding the code :cry:

HoneyryderChuck commented 10 years ago

https://github.com/celluloid/dcell/blob/master/lib/dcell/node.rb#L73-L75

If it was, it isn't anymore, it appears. Here the node is sending a message and blocking on the receive. I think the receive API supports a timeout parameter, right? I think this could be implemented either with a default timeout or parameter on DCell.start. What do you think?

tarcieri commented 10 years ago

Timeouts are probably a good start

niamster commented 9 years ago

It's a bit reworked in pull request #88, I believe it will solve this

niamster commented 9 years ago

In current master there is a possibility to ping(and get latency accordingly) remote node before starting any communication(that should be done explicitly). Also when a local instance of a node(bound to remote one) is created, it checks current value of TTL. If remote node didn't update it within a grace period then exception is raised and creation is aborted. If the remote nodes goes down and doesn't answer within the grace period then the client will raise an exception.