Open glassfishrobot opened 15 years ago
@glassfishrobot Commented Reported by little_zizou
@glassfishrobot Commented @jfialli said: More information is necessary to research this issue.
1. Please describe what is meant by "out of network". Is the network cable being pulled from the machine?
2. We have a shoal qe test that verifies that all failure notifications are sent to surviving group members when a non-master node is killed (via kill -9). The test verifies that all FAILURE notifications are sent. (Tests are run on the main branch of shoal. Please confirm you are running these tests.)
Please submit logs (by attaching a zip of log files) that illustrate your issue. Logging of FINE would be sufficient to follow what is occuring.
@glassfishrobot Commented little_zizou said: Created an attachment (id=20) Scenario 1 Testcase
@glassfishrobot Commented little_zizou said:
More information is necessary to research this issue.
1. Please describe what is meant by "out of network". Is the network cable being pulled from the machine?
I have disabled my LAN network to simulate network failure kind of scenario (similar to unplugging network cable).
2. We have a shoal qe test that verifies that all failure notifications ... Please confirm you are running these tests.)
I have not run the tests which you have mentioned but, instead I have written my own test cases to verify joining nodes to the network and processing failure notifications.
TestCase Description: We have 3 systems with 3 shoal clients (Client1, Client2 & Client3), each client running on a different system with member token names as server1, server2 and server3 respectively, all in the same group.
Scenario 1: server2 and server3 are started before server1, now when we disable network on server1, I could see 2 FailureSuspectedSignals and 2 FailureNotificationsignals (for server2 and server3 respectively), as expected.
Scenario 2: Now we have 3 clients running on 3 different systems, but the name of member token which joins the group as "server1" is renamed as "server5".
Systems are started just like in the previous case. server2 and server3 are started before server5, and disabled the LAN on server5. This time I could see 2 FailureSuspectedSignals, but only one FailureNotificationSignal.
I have attached the test sources and logs of both Scenario1 and Scenario2 for your reference.
@glassfishrobot Commented little_zizou said: Created an attachment (id=21) Scenario 2 TestCase
@glassfishrobot Commented @jfialli said: Issue understood.
Code in question is a detected masterFailed and the fact that only the new master is allowed to announce the failure.
private void assignAndReportFailure(final HealthMessage.Entry entry) {
@glassfishrobot Commented @jfialli said: started analysis of issue from submitted logs. see previous comments made when reassigning issue to myself.
@glassfishrobot Commented @jfialli said: Summary of issue reported for scenario 2 submitted on Sept 14th.
When the network lan fails for a non-master instance of a group, the submitter of this issue expects to receive a FAILURE notification for each instance on the isolated subnet that is no longer reachable.
Shoal's heartbeat failure detection is working to detect that the instances no longer exist; however, isolated instance will not receive any failure notifications about the no longer reachable members of the group until it finally makes itself the master node.
For the submitted scenario 1, "server1" becomes the master node after "server2" is no longer reachable. So no FAILURE events are dropped for that scenario. Even though "server1" was not the master before lan is disabled, "server1" is made the Master node for its subnet of one immediately due to naming comparisions between it and the other remaining server names in gms group.
@glassfishrobot Commented File: Scenario1.zip Attached By: little_zizou
@glassfishrobot Commented File: Scenario2.zip Attached By: little_zizou
@glassfishrobot Commented This issue was imported from java.net JIRA SHOAL-93
I have been trying to use shoal with my application, assume I have a cluster kind of setup with four nodes running on four different systems. If suddenly one of the node goes out of network and it is not a master node, I get three FailureSuspectedSignals but not all three FailureNotificationSignals. If the node which went out of network was a master node then, I get three FailureSuspectedSignals and FailureNotificationSignals. Is this not the way it should behave even in the first case also.
Environment
Operating System: All Platform: Windows
Affected Versions
[current]