charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
207 stars 50 forks source link

Proactive fault tolerance fails due to sending message to dead node. #1279

Open pplimport opened 8 years ago

pplimport commented 8 years ago

Original author: Justin Miron Original issue: https://charm.cs.illinois.edu/redmine/issues/1279


CkLocMgr::deliverMsg attempts to send messages to evacuated chares.

Fails this check: if((!CmiNodeAlive(destPE) && destPE != allowMessagesOnly)){ CkAbort("Cannot send to a chare on a dead node"); }

CmiNodeAlive checks if the valid processor bit is set for destPE. allowMessagesOnly should set the value msg->pe on every node during the ACK to the evacuation. This is set AFTER evacuation has occurred and the PE announces its evacuation.

allowMessagesOnly is set after valid processor bit is set to 0. If a message is attempted to be delivered between these two events, a failure could occur.

Investigating setting the allowMessagesOnly value first.

pplimport commented 5 years ago

Original author: Justin Miron Original date: 2016-11-03 14:52:22


This check may not be neccessary. If the PE was previously on a node that is now dead, then it should call DeliverUnknown as it may have been migrated. Though, this will trigger a deliver to the homePE, if the homePE is the dead processor then this will fail.

Check referred to: if((!CmiNodeAlive(destPE) && destPE != allowMessagesOnly)){ CkAbort("Cannot send to a chare on a dead node"); }

stwhite91 commented 5 years ago

Original date: 2016-11-03 16:25:06


This was changed in the 64bit ID changes. Look at line 2635 here: https://charm.cs.illinois.edu/gerrit/#/c/1217/ https://github.com/UIUC-PPL/charm/commit/71a0f8961609fd2bf40f62e1f337644f62734b7c5/src/ck-core/cklocation.C

pplimport commented 5 years ago

Original author: Justin Miron Original date: 2016-11-03 17:31:24


Thanks, that helps a lot.

Reinserting the getNextPE code works when finding the next PE off of the evacuated destPE integer. Using the CkArrayIndices leads to problems as the CkArrayindex* passed in is sometimes NULL.

getNextPE previously used a hash of the CkArrayIndices, need an equivalent for the integers.

The CkAbort is now avoided, but proactive fault tolerance still hangs.