missing FailureNotificationSignal during network failure when non-master is isolated

glassfishrobot commented 15 years ago

I have been trying to use shoal with my application, assume I have a cluster kind of setup with four nodes running on four different systems. If suddenly one of the node goes out of network and it is not a master node, I get three FailureSuspectedSignals but not all three FailureNotificationSignals. If the node which went out of network was a master node then, I get three FailureSuspectedSignals and FailureNotificationSignals. Is this not the way it should behave even in the first case also.

Environment

Operating System: All Platform: Windows

Affected Versions

[current]

glassfishrobot commented 6 years ago

Issue Imported From: https://github.com/javaee/shoal/issues/93
Original Issue Raised By:@glassfishrobot
Original Issue Assigned To: @jfialli

glassfishrobot commented 15 years ago

@glassfishrobot Commented Reported by little_zizou

glassfishrobot commented 15 years ago

@glassfishrobot Commented @jfialli said: More information is necessary to research this issue.

1. Please describe what is meant by "out of network". Is the network cable being pulled from the machine?

2. We have a shoal qe test that verifies that all failure notifications are sent to surviving group members when a non-master node is killed (via kill -9). The test verifies that all FAILURE notifications are sent. (Tests are run on the main branch of shoal. Please confirm you are running these tests.)

Please submit logs (by attaching a zip of log files) that illustrate your issue. Logging of FINE would be sufficient to follow what is occuring.

glassfishrobot commented 14 years ago

@glassfishrobot Commented little_zizou said: Created an attachment (id=20) Scenario 1 Testcase

glassfishrobot commented 14 years ago

@glassfishrobot Commented little_zizou said:

More information is necessary to research this issue.

1. Please describe what is meant by "out of network". Is the network cable being pulled from the machine?

I have disabled my LAN network to simulate network failure kind of scenario (similar to unplugging network cable).

2. We have a shoal qe test that verifies that all failure notifications ... Please confirm you are running these tests.)

I have not run the tests which you have mentioned but, instead I have written my own test cases to verify joining nodes to the network and processing failure notifications.

TestCase Description: We have 3 systems with 3 shoal clients (Client1, Client2 & Client3), each client running on a different system with member token names as server1, server2 and server3 respectively, all in the same group.

Scenario 1: server2 and server3 are started before server1, now when we disable network on server1, I could see 2 FailureSuspectedSignals and 2 FailureNotificationsignals (for server2 and server3 respectively), as expected.

Scenario 2: Now we have 3 clients running on 3 different systems, but the name of member token which joins the group as "server1" is renamed as "server5".

Systems are started just like in the previous case. server2 and server3 are started before server5, and disabled the LAN on server5. This time I could see 2 FailureSuspectedSignals, but only one FailureNotificationSignal.

I have attached the test sources and logs of both Scenario1 and Scenario2 for your reference.

glassfishrobot commented 14 years ago

@glassfishrobot Commented little_zizou said: Created an attachment (id=21) Scenario 2 TestCase

glassfishrobot commented 14 years ago

@glassfishrobot Commented @jfialli said: Issue understood.

Code in question is a detected masterFailed and the fact that only the new master is allowed to announce the failure.

private void assignAndReportFailure(final HealthMessage.Entry entry) {

final boolean masterFailed = (masterNode.getMasterNodeID()).equals(entry.id); if (masterNode.isMaster() && masterNode.isMasterAssigned()) { } else if (masterFailed) { //remove the failed node LOG.log(Level.FINE, MessageFormat.format("Master Failed. Removing System Advertisement : {0} for master named {1}", entry.id.toString(), entry.adv.getName())); manager.getClusterViewManager().remove(entry.adv); masterNode.resetMaster(); masterNode.appointMasterNode(); if (masterNode.isMaster() && masterNode.isMasterAssigned()) { LOG.log(Level.FINE, MessageFormat.format("Announcing Failure Event of {0} for name {1} ...", entry.id, entry.adv.getName())); final ClusterViewEvent cvEvent = new ClusterViewEvent(ClusterViewEvents.FAILURE_EVENT, entry.adv); masterNode.viewChanged(cvEvent); } } cleanAllCaches(entry); } } To avoid multiple reports of a FAILURE, only the master is typically allowed to report failure to rest of cluster. For Scenario 2, when the network lan is disabled on "server5", the reporter of this issue is looking for both "server2" and "server3" to have failure events. While the heartbeat failure detection does detect both server2 and server3 are failed (from server5's point of view, they are both running in their own subnet)in submitted logs for scenario2, the failure is not reported for server2 since server3 is calculated to be the new master for server5\. Unfortunately, server3 also can not communicate with "server5". Thus the missing announce of the failure of server2\. When "server3" is detected to have failed, then server5 is the sole instance left in its subnet cluster, it becomes the master and reports that server3 has failed. To summarize, heartbeat failure detection is working correctly. "server5" view of cluster is correct, just the failure notification for "server2" is missing in this scenario. Reason for missing failure is in code fragment included above.

glassfishrobot commented 14 years ago

@glassfishrobot Commented @jfialli said: started analysis of issue from submitted logs. see previous comments made when reassigning issue to myself.

glassfishrobot commented 14 years ago

@glassfishrobot Commented @jfialli said: Summary of issue reported for scenario 2 submitted on Sept 14th.

When the network lan fails for a non-master instance of a group, the submitter of this issue expects to receive a FAILURE notification for each instance on the isolated subnet that is no longer reachable.

Shoal's heartbeat failure detection is working to detect that the instances no longer exist; however, isolated instance will not receive any failure notifications about the no longer reachable members of the group until it finally makes itself the master node.

For the submitted scenario 1, "server1" becomes the master node after "server2" is no longer reachable. So no FAILURE events are dropped for that scenario. Even though "server1" was not the master before lan is disabled, "server1" is made the Master node for its subnet of one immediately due to naming comparisions between it and the other remaining server names in gms group.

glassfishrobot commented 14 years ago

@glassfishrobot Commented File: Scenario1.zip Attached By: little_zizou

glassfishrobot commented 14 years ago

@glassfishrobot Commented File: Scenario2.zip Attached By: little_zizou

glassfishrobot commented 7 years ago

@glassfishrobot Commented This issue was imported from java.net JIRA SHOAL-93

eclipse-ee4j / glassfish-shoal

missing FailureNotificationSignal during network failure when non-master is isolated #93

Environment

Affected Versions