charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
203 stars 50 forks source link

Array pointers in messages can be overwritten by generated code in the message class constructor (causes crash on GCC 6+) #1045

Closed PhilMiller closed 7 years ago

PhilMiller commented 8 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/1045


https://charm.cs.illinois.edu/private/tms/listlog.php?param=1111

Upcoming changes in GCC 6 optimizer capabilities may resurface this old bug even with the workaround implemented https://gcc.gnu.org/gcc-6/changes.html

-flifetime-dse is more aggressive in dead-store elimination in situations where a memory store to a location precedes a constructor to the memory location.

If the various bits of memory allocation and construction get inlined, potentially even across object files with link-time optimization, then the stores to the pointers from the message struct to the associated spots in the memory buffer allocated for it and its payload might get eliminated.

PhilMiller commented 5 years ago

Original date: 2016-08-28 20:59:30


I need to investigate more, but I think this has come to the fore.

PhilMiller commented 5 years ago

Original date: 2016-08-28 21:16:33


This crashes in megatest

./build test netlrts-linux-x86_64 --suffix=gcc6-prod-lifetime-dse --with-production --enable-errorchecking -j4 -g

Whereas this:

./build test netlrts-linux-x86_64 --suffix=gcc6-prod-no-lifetime-dse --with-production --enable-errorchecking -j4 -g -fno-lifetime-dse

Does not crash, successfully runs the full suite of tests and examples.

PhilMiller commented 5 years ago

Original date: 2016-08-28 22:02:03


More details on the optimization here: https://gcc.gnu.org/gcc-6/porting_to.html

trquinn commented 5 years ago

Original date: 2016-10-12 04:39:26


ChaNGa also dies in a CkReduction when compiled with GCC 6.2.1. Recompiling with fno-lifetime-dse fixes the problem.

PhilMiller commented 5 years ago

Original date: 2016-10-19 21:35:55


https://charm.cs.illinois.edu/gerrit/1928 https://github.com/UIUC-PPL/charm/commit/97b11599260b02a85d78318348f1d3064913170f

stwhite91 commented 5 years ago

Original date: 2016-10-29 14:58:01


On net-win64 we get compiler warnings because MSVC doesn't support the -fno-lifetime-dse option:

Ignored Unrecognized argument -fno-lifetime-dse
stwhite91 commented 5 years ago

Original date: 2016-11-04 15:23:53


XLC actually gives an error for this, so the build aborts:

../bin/charmc -host  -I../../src/xlat-i/ -I../../src/xlat-i/sdag/ -I.   -c -o xi-main.o ../../src/xlat-i/xi-main.C
g++: unrecognized option '-qlanglvl=extended0x'
cc1plus: error: unrecognized command line option "-fno-lifetime-dse"
PhilMiller commented 5 years ago

Original date: 2016-11-04 15:26:45


Ugh, the configure test is supposed to catch that. I'll fix.

nikhil-jain commented 5 years ago

Original date: 2016-11-06 01:06:16


Phil Miller wrote:

Ugh, the configure test is supposed to catch that. I'll fix.

Config does catch it, but on BG/Q cross compilation means we use different compilers for SEQ and rest.

PhilMiller commented 5 years ago

Original date: 2016-11-08 00:27:54


On examination, there is no case where the flags we're detecting in configure should actually be passed to the native/host compiler. So, disable passing those flags to the native compiler, and we've fixed the Blue Gene Q issue, and partially mitigated the (less problematic) MSVC issue: https://charm.cs.illinois.edu/gerrit/#/c/1972/ https://github.com/UIUC-PPL/charm/commit/5d10d7ed32baf75c39fb01874d80803978cbf906 Would test on BG/Q, but it's not back online from maintenance yet.

PhilMiller commented 5 years ago

Original date: 2016-11-08 18:27:17


Subsequent fixes pushed, awaiting review.

PhilMiller commented 5 years ago

Original date: 2016-11-09 18:20:08


It'll be nice to see this get through autobuild, but the fix is in.

PhilMiller commented 5 years ago

Original date: 2017-07-11 16:14:17


My original notes from 2011-01-07:

Shortly before Christmas, Akhil and I isolated a bug in Charm++ code that only exhibits in newer compilers. Specifically, when creating messages containing variable length arrays, we rely on the memory allocation returned by our operator new() being passed to the message's constructor unmodified. In some circumstances [1], this is not the case, and the pointers that the allocator initializes as offsets into the allocated buffer get overwritten. Subsequent code dereferencing those pointers goes astray.

Two major things to consider:

  1. We should not make a new (non-bugfix) release with a bug like this looming over our heads, and
  2. I'm pretty sure fixing this will require an API change, though the magnitude of that change spans many possibilities.

The plan for this meeting is to explore what those possibilities are, and what they entail in terms of implementation effort, application adaptation effort, maintainability, and performance. I would appreciate if group members with experience developing against APIs with RDMA-like constructs (ibverbs, LAPI) would attend.

[1] Right now, the one I know of is when the compiler generates a default constructor. There may be others, and that set may vary over time, compiler version, options, etc.

Email exchange with Jim, same day:

Jim sent the following while we met:

This might be a product of new initialization rules for plain old data. Try removing the final () from both of these lines in varsizetest2.C:

varsizetest2_Msg *m = new (sizes,0) varsizetest2_Msg();

m = new (sizes,0) varsizetest2_Msg();

The () after the class name is an explicit initializer, which causes all elements to be default inititialized to zero. If you declare a constructor then it's no longer plain old data, and the default constructor is expected to initialize everything.

See http://www.informit.com/guides/content.aspx?g=cplusplus&seqNum=423

Try making the CMessage_varsizetest2_Msg constructor protected in the .decl.h file, which would actually solve the problem if it works.

If that doesn't work, the appropriate API change would be to move the pointer declarations to the CMessage_varsizetest2_Msg class. The user would have to comment them out (or use a macro) in the varsizetest2_Msg declaration and they would only exist in the .ci file.

Next message:

Maybe also try adding a private copy constructor to CMessage_varsizetest2_Msg in the .decl.h file.

Anything to keep C++0X from making the message plain old data.

http://en.wikipedia.org/wiki/C%2B%2B0x

The insight about POD versus class-with-constructor is one we missed, but it doesn't really matter. The compiler is allowed to do whatever it wants with the block of memory returned by operator new() before calling the constructor. XLC even supports a debugging flag to have such memory initialized to a particular pattern. The compiler could also ask for extra bytes that it will take off the beginning of the allocation for whatever purpose, a possibility that we ignore.

The API change suggested is the same as the minimal-effort change we came up with.

Meeting minutes, same day:

We explored the issue, and concluded that some API change would be necessary.

There are two prime possibilities:

When a message declared in a .ci file mentions variable length arrays, the pointers to those arrays should be declared in the base CMessage_foo class and initialized to the appropriate offset in its constructor, using length data hidden somewhere by the runtime. The requires user code to remove the same declarations, possibly conditional on the version of Charm++ in which this change occurs.

We could transition to an API where the client code provides a list of pieces of memory that should end up the in the outgoing message, with lengths and possibly strides. This matches the underlying APIs available in TCP (sendv()), LAPI, and probably others. This smoothly enables later implementation of optional zero-copy optimization if the caller promises not to modify the buffers in question until the send is complete.

January 18:

After discussions with people, I've started along the following path:

  1. Implement the partial workaround that Gengbin described, by modifying charmxi as follows: a. for each message class, declare an array _ck_sizes of size_t with enough slots for all of the variable length arrays b. in CMessage_foo::operator new(), save the sizes passed in as arguments to _ck_sizes c. Move the setting of pointers from CMessage_foo::operator new() to CMessage_foo::CMessage_foo(), using the saved sizes
  2. Start exploring forward-looking APIs that obviate the still-possible issue of overwriting between CMessage_foo::CMessage_foo() and Message::Message()

Currently experiencing difficulty with 1a, in that CpvStaticDeclare won't work for static class members. Can't just declare outside the class with the name appended, because of templates. Will continue tomorrow.

Next update:

Coded bugfix. Tested net-linux-x86_64{,-smp} through added test in megatest.

Commits 5d1a75ed740aaefe35d2902c7b70276e7abcd5ef and c0776f0edad053d8315e6486f1f9ee4e3b28f7e3.

Final note, January 2012:

To actually make this bug impossible would still require API changes