charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
199 stars 50 forks source link

Idle PEs compete with comm thread for node queue lock #1035

Open jcphill opened 8 years ago

jcphill commented 8 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/1035


We've observed via projections that, in SMP mode when most of the PEs on a node are idle, the communication thread is very slow at adding incoming messages to the node queue. I assume this is because there is a single lock on the node queue, which the PEs are constantly fighting over, preventing the comm thread from adding work. I would suggest a two-level locking approach, adding a consumer lock which PEs must obtain before taking the queue lock to receive messages. In this way idle PEs compete only with each other (which is fine). PEs should try the consumer lock but not wait for it to become available, to not delay checking their private queue. Adding to the node queue should only require the main lock.

jcphill commented 5 years ago

Original date: 2016-04-14 14:26:08


This appears to be the node queue receive function in src/arch/util/machine-common-core.c:

 void *CmiGetNonLocalNodeQ(void) {
    CmiState cs = CmiGetState();
    char *result = 0;
    CmiIdleLock_checkMessage(&cs->idle);
    if (!CMIQueueEmpty(CsvAccess(NodeState).NodeRecv)) {
        MACHSTATE1(3,"CmiGetNonLocalNodeQ begin %d {", CmiMyPe());
        CmiLock(CsvAccess(NodeState).CmiNodeRecvLock);
        result = (char *) CMIQueuePop(CsvAccess(NodeState).NodeRecv);
        CmiUnlock(CsvAccess(NodeState).CmiNodeRecvLock);
        MACHSTATE1(3,"} CmiGetNonLocalNodeQ end %d ", CmiMyPe());
    }
    return result;
 }

Rather than checking that the queue is non-empty and then calling CmiLock, it may be better to call CmiTryLock, e.g.:

 void *CmiGetNonLocalNodeQ(void) {
    CmiState cs = CmiGetState();
    char *result = 0;
    CmiIdleLock_checkMessage(&cs->idle);
    while (!CMIQueueEmpty(CsvAccess(NodeState).NodeRecv)) {
        if (CmiTryLock(CsvAccess(NodeState).CmiNodeRecvLock)) continue;
        MACHSTATE1(3,"CmiGetNonLocalNodeQ begin %d {", CmiMyPe());
        result = (char *) CMIQueuePop(CsvAccess(NodeState).NodeRecv);
        CmiUnlock(CsvAccess(NodeState).CmiNodeRecvLock);
        MACHSTATE1(3,"} CmiGetNonLocalNodeQ end %d ", CmiMyPe());
        break;
    }
    return result;
 }

What is confusing is that PCQueuePop appears to already be thread safe, in which case the real question is why do CmiPushNode() and CmiSendNodeSelf() need to take CmiNodeRecvLock before calling CMIQueuePush?

harshithamenon commented 5 years ago

Original date: 2016-05-12 19:17:56


Jim, Bilge mentioned that you had implemented this and sent a patch to Antti-Pekka but it did not show benefit. Can you update this issue with the details.

Thank you

jcphill commented 5 years ago

Original date: 2016-05-13 16:33:21


I tried the above variant, as well as adding a lock to restrict node queue access to a single consumer at a time. Neither showed improvement in the observed performance.

stwhite91 commented 5 years ago

Original date: 2016-12-14 21:42:12


the new lockless PCQueue should alleviate this issue: https://charm.cs.illinois.edu/gerrit/#/c/1302/ https://github.com/UIUC-PPL/charm/commit/ddf5c291d43655cef50bcb50bad0631cd2fcb818

stwhite91 commented 5 years ago

Original date: 2017-02-01 17:38:24


The new lockless queue (gerrit patch linked above) will address this issue, so re-assigning to Bilge as the owner of that patch.

stwhite91 commented 5 years ago

Original date: 2017-04-04 20:52:27


We are not planning on merging the lockless queue before the 6.8.0 release, since it is high risk this close to the release.

stwhite91 commented 5 years ago

Original date: 2018-01-29 12:47:27


This issue needs to be re-evaluated

stwhite91 commented 5 years ago

Original date: 2018-02-01 14:44:15


One proposed solution: instead of having one multi-producer/single-consumer queue per PE that the comm thread and all worker threads alike push work into, split that queue into two queues: one single-producer/single-consumer queue at each PE for the comm thread only to push messages to, and one multi-producer/single-consumer queue per PE for all the other worker threads in its process to push messages to. This would eliminate the contention between comm thread and worker threads, though it would add an extra queue for the scheduler to poll.

Abhishek and and Samarth spent some time benchmarking the different queues we have last semester, so could reassign this to them? Seonmyeong also has experience with C++11 atomics and with the Csd module.

Anyway, it seems like a stretch for this to be resolved before 6.9.0

stwhite91 commented 5 years ago

Original date: 2018-02-11 19:02:52


Since there hasn't been any movement on this recently and it doesn't involve any user-visible API changes, retargeting to 6.9.1

stwhite91 commented 5 years ago

Original date: 2018-08-29 20:33:43


Is there a reproducible test program that exhibits this issue, so that we can get someone working on this.