Clear stale nodes - Githubissues

celluloid / dcell

UNMAINTAINED: See celluloid/celluloid#779 - Actor-based distributed objects in Ruby based on Celluloid and 0MQ

http://celluloid.io

MIT License

595 stars 65 forks source link

Clear stale nodes #105

Open doits opened 9 years ago

doits commented 9 years ago

I've played around with DCell a little bit, but now I have this:

DCell::Node.all.length
=> 75
DCell::Node.all.map(&:addr).uniq.length
=> 60

I've only two nodes running just now, but it still lists 75 of them. Also, it lists multiple nodes with the same address (which cannot be, right?). Is there any way to clear stale/dead/removed nodes?

doits commented 9 years ago

With this I've noted that exiting programs which used DCell hang really long after displaying

 DEBUG -- : Terminating 89 actors...

I flushed redis db manually and it came back to normal, but shouldn't stale nodes be cleared automatically?

Asmod4n commented 9 years ago

Zeromq is "stateless" when it comes to connections, you can still send messages to a peer which is disconencted and it will automatically send those messages again when it comes back online.

Asmod4n commented 9 years ago

But if needed one could implement a ping/pong mechanism for DCell which would disconnect inactive nodes.

doits commented 9 years ago

At least it should not hang (on termination or sending messages to nodes) when a lot of stale nodes are present.

Asmod4n commented 9 years ago

one would have to set the sndtime to 0 for each zmq socket on shutdown so it discards all remaining messages.

doits commented 9 years ago

yeah, that's a good idea - if there are remaining messages on shutdown output a warning and discard them after for example waiting 10 seconds (user configurable).

Also a configurable timeout when a node hangs would be great, for example when I try DCell::Node['which_is_dead].all, it hangs really long - it should throw an exception after a user configurable time (or if it does it already after too long time, the time should be configurable :-))

niamster commented 9 years ago

@doits it's already like this in master. Dead nodes are not taken into account(though they are still present in the DB).

tarcieri commented 9 years ago

At one point nodes healthchecked other nodes and marked them down if they didn't get responses. Did that get lost along the way?

niamster commented 9 years ago

@tarcieri @doits in current master there are currently 3 ways to bypass dead nodes:

you have node#ping(timeout) to check if node is alive before trying to touch it
periodical heartbeat to interrupt requests to the nodes that passed away in the meantime (10 sec by default)
node lifebeat - client won't try to connect to the node if it didn't update status within some timeout(20 sec by default)

If you are accessing actor by id(w/o specifying the node) you get all actors with request ID from all alive nodes: scratchy example

doits commented 9 years ago

I switched to master now and things go much smoother now. Didn't have enough time to test it, though, so maybe tomorrow I can say more. Thanks for the explanation!