Seagate / halon

High availability solution
Apache License 2.0
1 stars 0 forks source link

HALON-895: fix cluster stuck after TS-node crash regression #1566

Closed andriytk closed 5 years ago

andriytk commented 5 years ago

The regression was introduced at commit 619bc5b0 in this code:

+isRCNode :: NodeId -> Process (Bool)
+isRCNode nid = do
+    self <- getSelfPid
+    whereisRemoteAsync nid labelRecoveryCoordinator
+    void . spawnLocal $ receiveTimeout (1000000) [] >> usend self ()
+    receiveWait
+      [ match (\(WhereIsReply _ mp) -> (if isNothing mp then return False else return True))]
+    where
+      labelRecoveryCoordinator = "mero-halon.RC"

There were two problems about the timeout here: 1) the timeout event from the spawned thread was ignored and 2) the timeout thread was not cancelled in normal (non-timeout) case. So it was always sending the `()' event to itself which could cause all sorts of weird behaviours (including our one).

Now we don't spawn the timeout thread at all, but directly use receiveTimeout instead of receiveWait.

vvv commented 5 years ago

Looks perfect to me.

vvv commented 5 years ago

merged

rajanikantchirmade commented 5 years ago

Looks good to me.

andriytk commented 5 years ago

assigned to @rajanikant.chirmade

andriytk commented 5 years ago

@rajanikant.chirmade @vvv please review it.

andriytk commented 5 years ago

changed the description