charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
203 stars 49 forks source link

pxshm shared queue lockless implementation is invalid #571

Open ericjbohm opened 10 years ago

ericjbohm commented 10 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/571


In studying the issue of the auto-detection of +nodesize, I found several bugs in the existing implementation.

The correct implementations in pxshm use locking, which requires thread infrastructure from the -pthread set of libraries. Works ok for the SMP context where multiple reader/writer situations arise. It is theoretically possible to do a lockless version of that, but that is outside the scope of the immediate problem.

The non-smp (PXHSM only) build defaults to using memory fences and is erroneous. This can be reproduced in megatest in with --with-production --enable-error-checking build of megatest with TESTOPTS="+nodesize 4", which will abort. "ptr - recvBuf->data == recvBuf->header->bytes" failed in file machine-pxshm.c Other choices of +nodesize may result in hangs.

The code as implemented tries to use volatile variables and memory fences as though they were locks. Neither of those is equivalent to a critical section lock on its own, or in combination, and therefore the implementation is riddled with race condiiions. It needs to be rewritten. I am working on a scheme that uses atomic operations.

ericjbohm commented 5 years ago

Original date: 2015-01-23 22:57:09


Best fix would use C11 atomics, or conversion of machine layer infrastructure to C++ and use C++11 atomics. Need compiler ecosystem to catch up before the former could be applied, and the latter would be a massive undertaking.

stwhite91 commented 5 years ago

Original date: 2016-12-14 20:38:29


Note that the compiler ecosystem has basically all caught up, and so C11 atomics are being used in the new lockless PCQueue: https://charm.cs.illinois.edu/gerrit/#/c/1302/ https://github.com/UIUC-PPL/charm/commit/ddf5c291d43655cef50bcb50bad0631cd2fcb818

stwhite91 commented 5 years ago

Original date: 2017-02-01 18:04:37


pxshm is becoming slightly more important with wider nodes, so bumping up the priority. Also putting under SMP because this issue is more related to SMP ones than other machine layer ones.

Perhaps we should re-assign this if it is not a priority for you?

stwhite91 commented 5 years ago

Original date: 2017-10-31 14:00:30


We are looking to use CMA rather than pxshm where possible, possibly obviating the need for this.