potential to miss FAILURE_NOTIFICATION when multiple instances killed at same time

glassfishrobot commented 16 years ago

Bug was uncovered during a code review. The bug is a FAILURE notification could be missed when 2 or more more instances are killed at same time. (Note that given the race condition between node agent restarting a killed instance and the failure notification, only a test that kills the node agent and then kills instances can be assured of seeing a FALIURE_NOTIFICATION for each server instance killed. A node agent can restart a server instance before shoal reports it as FAILED.)

HealthMonitor.InDoubtPeerDetector.processCacheUpdate() iterates over all instances in cluster checking if any are in doubt. If one instance is detected to be indoubt, HealthMonitor.InDoubtPeerDetector.determineInDoubtPeers() notifies the FailureVerifier thread to process current cache looking for InDoubtPeers to verify which instance should have FAILURE_NOTIFICATION sent.

synchronized (verifierLock)

{ verifierLock.notify(); LOG.log(Level.FINER, "Done Notifying FailureVerifier for " + entry.adv.getName()); }

The notification signal from InDoubtPeerDetector thread to FailureVerifier thread is the weak link in this bug. When multiple failures happen at once, the code is currently written to act on the first instance failure immediately. The InPeerDoubtDetector should iterate over all instances AND if one OR more instances are in doubt, then it should notify the FailureVerifier thread to run over all instances in cluster cache.

Bug could be that InDoubtPeerDetector, runs twice, one notifiying FailureVerifier() to run on instance cache and it detects first killed instance. The second time the InDoubtPeerDetector runs, it could notify the FailureDetector while it is still working on verifiying first failure (with a snap shotted cache). The second notify to a running FailureVerifier thread will have no impact and the FAILURE_NOTIFICATION for the second killed server instance will be detected much later when the next failure occurs or the client is shutdown.

Environment

Operating System: All Platform: All

Affected Versions

[current]

glassfishrobot commented 6 years ago

Issue Imported From: https://github.com/javaee/shoal/issues/74
Original Issue Raised By:@glassfishrobot
Original Issue Assigned To: @jfialli

glassfishrobot commented 16 years ago

@glassfishrobot Commented Reported by @jfialli

glassfishrobot commented 7 years ago

@glassfishrobot Commented This issue was imported from java.net JIRA SHOAL-74

eclipse-ee4j / glassfish-shoal

potential to miss FAILURE_NOTIFICATION when multiple instances killed at same time #74

Environment

Affected Versions