charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
203 stars 49 forks source link

hangs on verbs error #926

Closed jcphill closed 8 years ago

jcphill commented 8 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/926


On Stampede: Info: Startup phase 0 took 0.00885892 s, 237.66 MB of memory in use [0] wc[0] status 12 wc[i].opcode 0 [14] wc[0] status 12 wc[i].opcode 0 [1] wc[0] status 12 wc[i].opcode 0 ...hangs...

I see in src/arch/verbs/machine-ibverbs.c the following: if(wc[i].status != IBV_WC_SUCCESS){ printf("[%d] wc[%d] status %d wc[i].opcode %d\n",CmiMyNodeGlobal(),i,wc[i].status,wc[i].opcode);

if CMK_IBVERBS_STATS

    printf("[%d] msgCount %d pktCount %d packetSize %d total Time %.6lf s processBufferedCount %d processBufferedTime %.6lf s maxTokens %d tokensLeft %d minTokensLeft %d \n",CmiMyNodeGlobal(),msgCount,pktCount,packetSize,CmiTimer(),processBufferedCount,processBufferedTime,maxTokens,context->tokensLeft,minTokensLeft);

endif

                    CmiAssert(0);
            }

Since CmiAssert compiles to null in production mode it accomplishes nothing (or wastes computer time in this case). Assertions are for catching bugs, not failures!

jcphill commented 5 years ago

Original date: 2015-12-17 14:56:24


Build from Dec 11 (v6.7.0-rc2-0-g7008690) works fine.

jcphill commented 5 years ago

Original date: 2015-12-17 16:59:59


Root cause addressed as "Bug #927: ibverbs broken by undef QLOGIC" so can wait for 6.7.1.

jcphill commented 5 years ago

Original date: 2016-01-28 20:43:50


This is partially resolved by the CkEnforce commit, but proper error messages are needed - now issue #960.

PhilMiller commented 5 years ago

Original date: 2016-02-25 19:49:27


I think we can close this now, since the hangs themselves have been addressed.

rbuch commented 5 years ago

Original date: 2016-03-08 20:18:19


Bilge, can you confirm that we can close this?

bilgeacun commented 5 years ago

Original date: 2016-03-09 02:01:17


Yes, i think we can close it now too.