charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
203 stars 49 forks source link

Use of function pointers causes CkCallback errors in some ASLR environments #2018

Open evan-charmworks opened 5 years ago

evan-charmworks commented 5 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/2018


For example, running examples/charm++/wave2d across two or more hosts on AArch64 results in:

Mismatched callback details: reducers (CkReduction::sum_int, CkReduction::sum_int); callback types (CkCallback::call1Fn, CkCallback::call1Fn)

(This callback originates from the liveViz implementation.)

If this error check is disabled, a segfault results, because messages containing callbacks of type call1Fn or callCFn include raw function pointers which are not guaranteed to be the same across hosts.

stwhite91 commented 5 years ago

Original date: 2018-10-26 21:49:55


The proposed solution is to provide an interface through which users register any function pointers they will use in Callbacks, and the runtime registers them in tables. The index to that table becomes the global ID through which the function pointer is looked up.

A shorter term goal would be to detect these errors at runtime and output a message about disabling ASLR if possible.

evan-charmworks commented 5 years ago

Related to #159

matthiasdiener commented 4 years ago

Is this now fixed by #2731? @evan-charmworks

evan-charmworks commented 4 years ago

2731 fixes liveViz in particular to not use the function pointer-based CkCallback constructs that fail when used in a reduction across logical nodes with ASLR. The constructs themselves are still there and can still be used in ways that cause the issue. As a test, I tried removing those call1Fn and callCFn callback types entirely and the build failed in ck-core/. Those cases are either not used in a way that causes the issue, or require special steps to cause the issue that have not been reported. Further work could still be done, for example to add warnings if a case that causes the problem is detected.

evan-charmworks commented 4 years ago

Unscheduling because ASLR workarounds and proper documentation are both in place.