jcphill commented 9 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/643

Using CkLoop in NAMD PME offload, works fine on 14 pes, fails with error on 30 when running +p30 ++ppn 30 +pemap 1-15,17-31 +commap 0,16 (on a 16 core, 32 thread machine):

[19] Stack Traceback: [19:0] CmiAbortHelper+0x73 [0xbb2933] [19:1] _ZN11CurLoopInfo9stealWorkEv+0x181 [0xa8fa2b] [19:2] _Z21SingleHelperStealWorkP17converseNotifyMsg+0xb9 [0xa8f8a8] [19:3] CsdScheduleForever+0x78 [0xbbaee8] [19:4] CsdScheduler+0x2d [0xbbafed] [19:5] [0xbb6a86] [19:6] [0xbb6f53] [19:7] [0x301e4079d1] [19:8] clone+0x6d [0x301e0e8b6d] Fatal error on PE 19> Indices of CkLoop incorrect. There maybe a race condition!

CkLoop_bug.patch ckloop.patch.1 ckloop.patch.2 ckloop.patch

jcphill commented 5 years ago

Original date: 2015-01-02 23:58:48

Just to add a little extra information, this is a non-sync call.

jcphill commented 5 years ago

Original date: 2015-01-04 22:00:41

I get the exact same error from a sync call.

harshithamenon commented 5 years ago

Original date: 2015-01-16 21:17:58

Jim,

Few months back I had fixed a race condition is CkLoop. I believe this one is recent right? Do you have a small example that I can use to debug?

Harshitha

jcphill commented 5 years ago

Original date: 2015-01-16 21:21:48

I don't know how to reproduce it outside of full NAMD. Was your fix included in 6.6.1?

harshithamenon commented 5 years ago

Original date: 2015-01-22 21:29:20

Yes that fix was put in 6.6.1 and I had added the Abort message you got.

I can use NAMD to reproduce this error.

harshithamenon commented 5 years ago

Original date: 2015-02-06 20:47:43

Do I need a specific dataset to reproduce it?

jcphill commented 5 years ago

Original date: 2015-02-09 19:53:57

You would need a specific NAMD hack, which I've now abandoned. It's probably better to do some thorough code review for conditions that could lead to that error.

harshithamenon commented 5 years ago

Original date: 2015-02-09 21:53:16

If it is easy enough, can you share the code changes required to reproduce this problem?

jcphill commented 5 years ago

Original date: 2015-02-10 19:11:58

Get the NAMD nightly build source and apply the attached patch.

Build a multicore CUDA NAMD binary (./config Linux-x86_64-g++ --with-cuda).

Get http://www.ks.uiuc.edu/Research/namd/utilities/apoa1.tar.gz and run

./namd2 +devices 0 --PMEOffload yes --FFTWEstimate yes +p30 +pemap 1-15,17-31 path/to/apoa1.namd

harshithamenon commented 5 years ago

Original date: 2015-02-17 21:26:20

I am using lab machine to reproduce the problem. I used your patch and built namd in multicore --with-cuda option. I am using Talent, which has total of 4 cores, for these runs since cuda is installed on that.

I enabled CkLoop and ran the following command. CkLoop constructor is called but parallelizeFunc is not being invoked. Am I missing something? Do you know if this bug shows up on smaller number of cores? If not the lab machines, which machine can I use to reproduce this problem?

./namd2 +devices 0 --PMEOffload yes --FFTWEstimate yes +p3 +pemap 1-3 ~/workspace/namd/apoa1/apoa1.namd

Thanks

jcphill commented 5 years ago

Original date: 2015-02-18 16:31:05

As I said originally, it works on 14 pes but fails on 30 on a 16-core, 32-thread machine.

Try Stampede or the campus cluster (Taub/Golub), or hand this bug off to someone with a TCBG token.

jcphill commented 5 years ago

Original date: 2015-03-13 06:16:15

I get the same error when building NAMD configured using --with-cuda and -DUSE_CKLOOP=1 etc. in the Makefile and running as: ./charmrun ++local +p14 ./namd2 ++ppn 7 +devices 0,2 $APOA1 --pmeoffload yes --pmepencils 1 --pmeprocessors 2 --fftwestimate yes --useCkLoop 4 --run 20

------------- Processor 11 Exiting: Called CmiAbort ------------ Reason: Indices of CkLoop incorrect. There maybe a race condition!

------------- Processor 7 Exiting: Called CmiAbort ------------ Reason: Indices of CkLoop incorrect. There maybe a race condition!

------------- Processor 12 Exiting: Called CmiAbort ------------ Reason: Indices of CkLoop incorrect. There maybe a race condition!

------------- Processor 9 Exiting: Called CmiAbort ------------ Reason: Indices of CkLoop incorrect. There maybe a race condition!

------------- Processor 8 Exiting: Called CmiAbort ------------ Reason: Indices of CkLoop incorrect. There maybe a race condition!

------------- Processor 10 Exiting: Called CmiAbort ------------ Reason: Indices of CkLoop incorrect. There maybe a race condition!

[11] Stack Traceback: [11:0] CmiAbortHelper+0x73 [0xbe7013] [11:1] _ZN11CurLoopInfo9stealWorkEv+0x181 [0xabad57] [11:2] _Z21SingleHelperStealWorkP17converseNotifyMsg+0xb9 [0xababd4] [11:3] CsdScheduleForever+0xdb [0xbef71b] [11:4] CsdScheduler+0x2d [0xbef7bd] [11:5] [0xbeb25a] [11:6] [0xbeb723] [11:7] [0x301e4079d1] [11:8] clone+0x6d [0x301e0e8b6d] Fatal error on PE 11> Indices of CkLoop incorrect. There maybe a race condition!

jcphill commented 5 years ago

Original date: 2015-03-13 07:07:50

Pretty trivial actually, just need to ensure that iteration range is at least as large as numChunks:

void CkLoop_Parallelize(HelperFn func, int paramNum, void param, int numChunks, int lowerRange, int upperRange, int sync, void redResult, REDUCTION_TYPE type) { if ( numChunks > upperRange - lowerRange + 1 ) numChunks = upperRange - lowerRange + 1; globalCkLoop->parallelizeFunc(func, paramNum, param, numChunks, lowerRange, upperRange, sync, redResult, type); }

I added some extra locks which debugging, so I'm not sure this is the only issue, in particular if CkLoop is being called from multiple threads.

jcphill commented 5 years ago

Original date: 2015-03-13 20:25:25

Attaching patch that I'm using with NAMD.

harshithamenon commented 5 years ago

Original date: 2015-03-14 05:30:34

Thanks Jim for fixing the bug.

I have sent your patch for review (http://charm.cs.uiuc.edu/gerrit/#/c/631/).

We don't require the extra locks because if nextChunkId is < numChunks then there is a worker working on a chunk. Only when that worker finishes will it call reportFinished. Note that I have done if (nextChunkId >= numChunks) after locking to get rid of the race. More details of the race is given in CkLoop.h:61. reportFinished atomically increments the finishFlag and a curLoop is free only when finishFlag == numChunks i.e. all the workers have called reportFinished. curLoop is not set until getNotifyMsg() is called which checks to see if the curLoop is free. Therefore numChunks and other fields are not modified in a racy fashion.

jcphill commented 5 years ago

Original date: 2015-03-15 05:15:36

Thinking about it some more that is correct, but there is a race on curChunkIdx being reset to -1 by set() if the message is recycled by getNewTask() while a non-sync call is still running. The next call to getNextChunkIdx will then report 0 but since there is no lock there is no guarantee that the writes to the other loop parameters are visible. What I think might work is to track the number of stealWork calls running and have isFree() only return true when it is zero.

I have an updated patch that I will post tomorrow.

jcphill commented 5 years ago

Original date: 2015-03-15 18:05:22

Updated patch attached.

jcphill commented 5 years ago

Original date: 2015-03-15 21:08:22

This version removes the "WARNING! chunk is set to MAX_CHUNKS=%d" warnings. Getting this at every timestep from 16,384 nodes sucks.

harshithamenon commented 5 years ago

Original date: 2015-05-08 19:54:48

Jim,

Patch http://charm.cs.uiuc.edu/gerrit/#/c/631/ had been checked in a month ago which takes care of the numChunks is less than upperbound-lowerbound+1.

I was trying to figure out whether there is really a race condition in the non-sync mode and I feel there is not. You brought up "but there is a race on curChunkIdx being reset to - 1 by set() if the message is recycled by getNewTask() while a non-sync call is still running". If a non-sync call is still running with chunks of work pending, then getNewTask will not be returning that particular CurLoopInfo. In function getNewTask() in CkLoop.h:255 cur ->isFree() is tested which checks to see if all the chunks are done (finishFlags == numChunks) and only then returns that CurLoopInfo. If a non-sync call is currently going on with that CurLoopInfo, then finishFlag will not be equal to numChunks and hence that CurLoopInfo won't be returned in getNewTask(). If all the chunks are done and a worker comes along and does getNextChunkIdx(), it will either get chunkIdx corresponding to the previous non-sync call, in which case it will see that all the chunks are done and return. If it does getNextChunkIdx() after the set is done for a new non-sync call, it will get 0 which is fine and it will start working on the 0th chunk in the new task. Not that locks are present in set() so race condition is avoided.

Please let me know if this is not the case you were thinking about.

Thank you

jcphill commented 5 years ago

Original date: 2015-05-09 16:20:10

I'm looking into the race again, but in the meantime please take out this warning message, or somehow limit it to being printed at most once per run since getting it at every timestep from 16,384 nodes is really annoying:

198 if (numChunks > MAX_CHUNKS) { 199 CkPrintf("CkLoop[%d]: WARNING! chunk is set to MAX_CHUNKS=%d\n", CmiMyPe(), MAX_CHUNKS); 200 numChunks = MAX_CHUNKS; 201 }

harshithamenon commented 5 years ago

Original date: 2015-05-09 17:53:25

Have sent a cl to remove the warning. There was another CL by Yanhua (http://charm.cs.uiuc.edu/gerrit/#/c/417/1/src/libs/ck-libs/ckloop/CkLoop.C) which removes the warning. But I guess its better to take care of it right now in another cl.

PhilMiller commented 5 years ago

Original date: 2015-07-09 21:17:22

Your change was http://charm.cs.uiuc.edu/gerrit/#/c/707/ right?

jcphill commented 5 years ago

Original date: 2015-08-05 18:38:46

Yes.

jcphill commented 5 years ago

Original date: 2015-08-05 18:47:55

OK, here is an obvious race condition for Windows:

volatile int curChunkIdx;

... int getNextChunkIdx() {

if defined(_WIN32)

if CMK_SMP

    CmiLock(cmiMemoryLock);
    curChunkIdx=curChunkIdx+1;
    CmiUnlock(cmiMemoryLock);
    return curChunkIdx;

else

    curChunkIdx++;
    return curChunkIdx;

endif

else

    return __sync_add_and_fetch(&curChunkIdx, 1);

endif

curChunkIdx should be copied to a local variable rather than returned directly.

harshithamenon commented 5 years ago

Original date: 2015-08-17 17:56:50

Fix ~~https://charm.cs.illinois.edu/gerrit/#/c/767/~~ https://github.com/UIUC-PPL/charm/commit/ade3f92739213e5317a5090fd681fe80792adb5c

Is any other race conditions known at present? If not, I will close this bug.

harshithamenon commented 5 years ago

Original date: 2015-08-21 01:32:49

No other issues reported, so closing this.

charmplusplus / charm

CkLoop race condition error #643

if defined(_WIN32)

if CMK_SMP

else

endif

else

endif