edwardcapriolo / gossip

A Mavenized Apache V2 gossip implementation for Java
Apache License 2.0
160 stars 54 forks source link

Dead members coming alive again are soon wrongy recognized as dead again #15

Closed chbndrhnns closed 8 years ago

chbndrhnns commented 8 years ago

Hey, when I ran tests with multiple clients on my local machine I discovered this behavior:

Did someone observe a similar behavior and has a fix to it?

edwardcapriolo commented 8 years ago

Is this random or does it happen every time. We should be able to provide a unit test that demonstrates this.

chbndrhnns commented 8 years ago

It happens every time to me. I created a test case (inherited from the existing tests) that demonstrates the issue at http://pastebin.com/nKHDHj7L

edwardcapriolo commented 8 years ago

Awesome! Nice use of tunit :) I will take a look at this. I feel like something similar has been reported before.

edwardcapriolo commented 8 years ago

Sorry for the delay. I was looking at this. I changed your test a bit. The issue is that higher heartbeat messages win often and nodes ping pong state. So the answer here is to implement some type of hold down timer such that nodes that are recently revived are not so eagerly marked dead again. I will look at that later tonight.

edwardcapriolo commented 8 years ago

I have tuned up the code a bit, I think I am going to turn hearbeat into a timestamp so that it is easier to establish cronology. Right now when a node starts up again its heartbeat is 0 and its gossip never "wins" over older recrods.

edwardcapriolo commented 8 years ago

@chbndrhnns I noticed your test had one minor bug in that your were removing adding a new gossiper to the list and not removing the old one. Still there were other issues that I addressed. Can you give https://github.com/edwardcapriolo/gossip/compare/ts_as_heartbeat?expand=1 a try and let me know your experience with it. If i dont hear from you in a few days I will merge

chbndrhnns commented 8 years ago

Hey, thanks very much for your work! While the new tests always pass, the timestamp heartbeat system does not work in my real life application. Maybe I can provide you with a test set that fails.