Open khuck opened 6 years ago
The Release and Debug errors seem to be unrelated. While the release error comes out of an actual action invocation, the Debug error happens in Spirit during parsing (presumably a PhySL expression). Both actually could be stack overflows :/
I think I eliminated the stack overflow issue by doubling the stack size (changing ulimit -s
) and getting the same crash, in the same location.
@khuck I don't think ulimit -s
has any bearings on the stack size used by HPX for its threads. I wouldn't rule out a stack overflow for this problem.
OK, trying with HPX_WITH_STACKOVERFLOW_DETECTION_DEFAULT=On
Running with HPX_WITH_STACKOVERFLOW_DETECTION_DEFAULT=On
didn't change anything - it still crashed in roughly the same location, but slightly different:
#0 0x00003fffaf14cdd0 in hpx::threads::thread_data::set_description(hpx::util::thread_description) ()
from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/hpx-Release/lib/libhpx.so.1
#1 0x00003fffaf149d4c in hpx::threads::set_thread_description(hpx::threads::thread_id_type const&, hpx::util::thread_description const&, hpx::error_code&) ()
from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/hpx-Release/lib/libhpx.so.1
#2 0x00003fffb027ce40 in hpx::util::annotate_function::annotate_function(char const*) ()
from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0
#3 0x00003fffb027b294 in phylanx::execution_tree::primitives::primitive_component_base::do_eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const ()
from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0
#4 0x00003fffb0244a90 in phylanx::execution_tree::primitives::primitive_component::eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const ()
from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0
Could it be that there is an operation that is just missing an annotation, or is getting mis-annotated in some way?
After compiling with Clang 6.0 on an x86_64 machine, I think I confirmed it's a POWER8-specific problem. Is there something specific about this particular primitive that does something unusual?
@hkaiser - another clue... as you pointed out, the crash is in:
#367 0x00003fffb0609664 in phylanx::bindings::expression_evaluator(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, phylanx::bindings::compiler_state&, pybind11::args)::{lambda()#1}::operator()() const (this=0x3fffffffcd88)
at /home/users/khuck/src/phylanx/python/src/bindings/binding_helpers.hpp:181
...but the expression it is parsing is not that crazy:
181 auto xexpr = phylanx::ast::generate_ast(xexpr_str);
(gdb) print xexpr_str
$1 = "\nblock(\n define(fib,n,\n if(n<2,n,\n fib(n-1)+fib(n-2))),\n fib)"
except for the fact that it is a recursive definition.
Also, this didn't crash when I built it on an x86_64 machine (HPX and Phylanx were built with Clang 5.0) that used the ubuntu boost package (built by GCC, I assume). Whereas the machine that is crashing was using a boost built by clang 5.0.
Also, I built the test that crashed with -fstack-protector-all -fstack-protector-strong
and didn't see any difference.
@hkaiser The stack stuff might be a red herring. I have another clue. I tried running a RelWithDebInfo build. It crashes, but in a different way. 2 steps up the stack, the program is in the "eval" method of the primitive_component base class. When I dereference the "this" pointer, I get this back:
#2 0x00003fffb0259fa0 in phylanx::execution_tree::primitives::primitive_component::eval (this=0x10fc05a0,
params=..., mode=<optimized out>)
at /home/users/khuck/src/phylanx/src/execution_tree/primitives/primitive_component.cpp:123
123 return primitive_->do_eval(params, mode);
(gdb) print this
$10 = (const phylanx::execution_tree::primitives::primitive_component *) 0x10fc05a0
(gdb) print *this
$11 = {<hpx::components::component_base<phylanx::execution_tree::primitives::primitive_component>> = {<hpx::components::detail::base_component> = {<hpx::traits::detail::component_tag> = {<No data fields>}, gid_ = {
static credit_base_mask = 31, static credit_shift = 24, static credit_mask = 520093696,
static was_split_mask = 2147483648, static has_credits_mask = 1073741824,
static is_locked_mask = 536870912, static locality_id_mask = 18446744069414584320,
static locality_id_shift = 32, static virtual_memory_mask = 4194303,
static dont_cache_mask = 8388608, static is_migratable = 4194304, static dynamically_assigned = 1,
static component_type_base_mask = 1048575, static component_type_shift = 1,
static component_type_mask = 2097150, static credit_bits_mask = 3741319168,
static internal_bits_mask = 4290772992, static special_bits_mask = 18446744073707454462,
id_msb_ = 4294967376, id_lsb_ = 284951968}}, <No data fields>}, primitive_ = warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<phylanx::execution_tree::primitives::access_function, std::allocator<phylanx::execution_tree::primitives::access_function>, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<phylanx::execution_tree::primitives::access_function, std::allocator<phylanx::execution_tree::primitives::access_function>, (__gnu_cxx::_Lock_policy)2>'
std::shared_ptr (count 1, weak 0) 0x10fb77e0}
...which seems OK, except for the RTTI warning. Then, stepping down the stack things get interesting:
(gdb) down
#1 0x00003fffb028adb4 in phylanx::execution_tree::primitives::primitive_component_base::do_eval (
this=0x10fb77e0, params=std::vector of length 1, capacity 1 = {...},
mode=(phylanx::execution_tree::eval_dont_wrap_functions | phylanx::execution_tree::eval_dont_evaluate_partials | phylanx::execution_tree::eval_dont_evaluate_lambdas))
at /home/users/khuck/src/phylanx/src/execution_tree/primitives/primitive_component_base.cpp:89
89 auto f = this->eval(params, mode);
which also seems OK. but then taking one more step, into the concrete instance of the object:
(gdb) down
#0 0x00003fffb009d230 in phylanx::execution_tree::primitives::access_function::eval (this=0x0,
params=std::vector of length 1, capacity 1 = {...},
mode=(phylanx::execution_tree::eval_dont_wrap_functions | phylanx::execution_tree::eval_dont_evaluate_partials | phylanx::execution_tree::eval_dont_evaluate_lambdas))
at /home/users/khuck/src/phylanx/src/execution_tree/primitives/access_function.cpp:57
57 {
...you'll notice the "this" pointer is null! So for some reason, this object is either corrupted, or...? Is something missing from the implementation of phylanx::execution_tree::primitives::access_function
so that it isn't getting handled like the other primitives?
@hkaiser any thoughts on the above? I thought maybe it was because access_function didn't inherit from public std::enable_shared_from_this<access_function>
like the other primitives. Could that be the case?
@hkaiser any thoughts on the above? I thought maybe it was because access_function didn't inherit from public std::enable_shared_from_this
like the other primitives. Could that be the case?
@khuck: I don't think this causes the issue we're seeing. All primitives are kept alive by a shared_ptr
in any case, most of them however additionally need to stay alive for 'delayed' operation (requiring the enable_shared_from_this
), access_variable
is not one of those, iirc.
@hkaiser ok. I started playing with the code in eval.py, and it's crashing on the definition of fib10 (compressed here):
fib10 = et.eval(" block( define(fib,n, if(n<2,n, fib(n-1)+fib(n-2))), fib) ", cs, 10)
BUT if I change it to fib9, it works:
fib9 = et.eval(" block( define(fib,n, if(n<2,n, fib(n-1)+fib(n-2))), fib) ", cs, 9)
...and the same is true of the fib() function defined later, if I call it with fib(9) it's OK, but fib(10) crashes. So it is stack related, but it's the stack of the AST that is the problem. Reminder, this is Clang 5.0 on POWER8, so different beast than GCC on x86_64.
Yup, stack related. This issue will stay open, but a work-around for that platform has been committed. See pull request #601
This PR enables stack overflow prevention in HPX on Power platforms: https://github.com/STEllAR-GROUP/hpx/pull/3469. Please verify.
@khuck I believe the calculation of the remaining amount of stack space in my original patch was wrong. Could you try again, please?
@hkaiser nope, same error. I have asked for someone to send me the instructions for getting an account on our system if you want to test it yourself...
@khuck can you try a build with address sanitizer? This is usually very accurate in pinpointing to issues
@sithhell I did. I ran into so many linker issues I couldn't figure out how to fix them. I tried with valgrind, but after 3-4 hours building a suppression file, I was no closer to the cause of the problem.
@khuck for the linker errors, configure your HPX build with -DHPX_WITH_SANITIZERS=On
. This should solve most of them.
@sithhell IIRC, building HPX wasn't the problem, but building Phylanx was. The address sanitizer library was supposed to be first in the link order, but it wasn't. Besides, I built Clang 5.0 myself for this machine, and it's possible I didn't configure/build the sanitizer libraries correctly.
@sithhell @khuck it would be great to have a docker image with the address sanitizer enabled and working correctly.
@sithhell yes it would - are you volunteering? :)
OK, I attempted to make a Phylanx docker image that uses sanitize. I fail at the Phylanx link step. Here's the Dockerfile I attempted to use
https://gist.github.com/stevenrbrandt/56cc36a9c9cb0375ae264c398d0e3431
Setting -lasan
in CMAKE_EXE_LINKER
for Phylanx seems to do nothing.
However, setting -lasan
in CMAKE_CXX_FLAGS
allows Phylanx to link works - though it gives a bunch of spurious warning messages about using a link flag while not linking.
Regardless, however, I can't run bin/physl
because I get this error:
build]# bin/physl --doc
==15==Your application is linked against incompatible ASan runtimes.
Not sure how that comes about, since I only have the default Clang / libasan installed.
@sithhell Any idea what I'm doing wrong?
Address Sanitizer is really temperamental sometimes… Instead of adding -lasan
, can you add the specific library? i.e. /path/to/compiler/lib/libasan.so
instead of -lasan
to make sure you get the right one.
Kevin
@khuck I've discovered the -shared-libasan
flag. I'm experimenting with that.
@khuck @sithhell Current Dockerfile: https://gist.github.com/stevenrbrandt/27e1d4eb5fd86a4b57697567c3964697
Ok, this uses -shared-libasan
and everything compiles, but when I try to run Phylanx Hello World, I get this:
==27==Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING.
==27==ASan shadow was supposed to be located in the [0x00007fff7000-0x10007fff7fff] range.
==27==This might be related to ELF_ET_DYN_BASE change in Linux 4.12.
==27==See https://github.com/google/sanitizers/issues/856 for possible workarounds.
==27==Process memory map follows:
0x000000400000-0x0000007be000 /usr/bin/python3.6
0x0000009bd000-0x0000009be000 /usr/bin/python3.6
0x0000009be000-0x000000a5b000 /usr/bin/python3.6
0x000000a5b000-0x000000a8f000
Not sure what to do at this point.
@stevenrbrandt just curious - are you using the system allocator or tcmalloc/jemalloc?
@stevenrbrandt also, what happens if you run an example without python involved? like lra_csv or something like that?
@khuck I'm using the System Allocator, see the docker file I linked.
You can't even run "physl --doc" without problems:
# bin/physl --doc
terminate called after throwing an instance of 'std::runtime_error'
what(): Cannot instantiate more than one affinity data instance
Aborted
So, a small success (I think). The problem seems to have partly been the 80 core cluster I built it on...
Running on a smaller machine, I get this. You can try out stevenrbrandt/phylanx.sanitized from Docker yourself.
# ./bin/physl --doc
=================================================================
==27==ERROR: AddressSanitizer: odr-violation (0x7fb8b739c940):
[1] size=32 'hpx::util::detail::global_fixture' /hpx/src/util/lightweight_test.cpp:56:13
[2] size=32 'hpx::util::detail::global_fixture' /hpx/src/util/lightweight_test.cpp:56:13
These globals were registered at these points:
[1]:
#0 0x7fb8cd3385c8 (/usr/local/clang_7.0.0/lib/clang/7.0.0/lib/linux/libclang_rt.asan-x86_64.so+0x675c8)
#1 0x7fb8b46582dd in asan.module_ctor (/usr/local/lib/phylanx/libphylanx_solversd.so+0x39b2dd)
[2]:
#0 0x7fb8cd3385c8 (/usr/local/clang_7.0.0/lib/clang/7.0.0/lib/linux/libclang_rt.asan-x86_64.so+0x675c8)
#1 0x7fb8b6e14f7d in asan.module_ctor (/usr/local/lib/phylanx/libphylanx_arithmeticsd.so+0x2503f7d)
==27==HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_odr_violation=0
SUMMARY: AddressSanitizer: odr-violation: global 'hpx::util::detail::global_fixture' at /hpx/src/util/lightweight_test.cpp:56:13
==27==ABORTING
Did you try export ASAN_OPTIONS=detect_odr_violation=0
before running?
@khuck using that setting, the physl --doc
does run. I get this at the end:
==96==Could not attach to thread 72 (errno 1).
==96==Failed suspending threads.
==72==LeakSanitizer has encountered a fatal error.
==72==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==72==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)
OK, so Python doesn't work, but we can generate PhySL code from Python on another machine and run the PhySL code inside the phylanx.sanitized image.
Trying LRA with the PhySL interpreter (./examples/interpreter/lra.physl), I get this:
==191==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7f4897636550 at pc 0x7f48d0a1017c bp 0x7f4897636540 sp 0x7f4897635cf0
WRITE of size 16 at 0x7f4897636550 thread T15
#0 0x7f48d0a1017b (/usr/local/clang_7.0.0/lib/clang/7.0.0/lib/linux/libclang_rt.asan-x86_64.so+0xd717b)
#1 0x7f48c2deb290 in std::chrono::_V2::steady_clock::now() (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xb5290)
#2 0x7f48cfa4a8b7 in hpx::util::high_resolution_clock::now() /usr/local/include/hpx/util/high_resolution_clock.hpp:30:17
#3 0x7f48cfa4843a in phylanx::util::scoped_timer<long>::~scoped_timer() /phylanx/phylanx/util/scoped_timer.hpp:37:25
#4 0x7f48cfa42918 in phylanx::execution_tree::primitives::primitive_component_base::do_eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const /phylanx/src/execution_tree/primitives/primitive_component_base.cpp:103:5
#5 0x7f48cf9687da in phylanx::execution_tree::primitives::primitive_component::eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const /phylanx/src/execution_tree/primitives/primitive_component.cpp:123:28
OK, using the Address Sanitizer
block(
define(fib,n,
if(n<2,n,
fib(n-1)+fib(n-2))),
cout(fib(16))
)
runs without difficulty. This was the code that originally prompted the ticket (correct me if I'm wrong).
Of course, this was clang 7 not 8
@khuck I've updated stevenrbrandt/phylanx.sanitized so that it's only 6.6GB. Still not tiny.
@khuck should we get this sanitized phylanx image running in your test framework?
@stevenrbrandt yes, if you could send me the cmake configuration steps, I would appreciate it.
@khuck docker pull stevenrbrandt/phylanx.sanitized
to get the image. The Dockerfile itself is inside the image as /Dockerfile.
The Release build call stack is massive (318 functions deep) and the test fails this way:
The Debug build fails in a different location, but with an equally massive call stack (in the ~436 range). In the Debug build, it appears a boost "unused" type is passed as an attribute/context somewhere deep in boost: