STEllAR-GROUP / phylanx

An Asynchronous Distributed C++ Array Processing Toolkit
Boost Software License 1.0
75 stars 76 forks source link

physl sometimes does not write newick tree or graph #1186

Open stevenrbrandt opened 4 years ago

stevenrbrandt commented 4 years ago

The command

['mpirun', '-np', '1', '-machinefile', 'hosts.txt', '/work/sbrandt/phylanx/build/bin/physl', '--dump-counters=py-csv.txt', '--dump-newick-tree=py-tree.txt', '--dump-dot=py-graph.txt', '--performance', '--print=result.py', 'call_lra_demo.physl']

applied to the physl code generated by this python code

def lra_demo(x, y, alpha, iterations, enable_output):
    weights = np.zeros(np.shape(x)[1])
    transx = np.transpose(x)
    pred = np.zeros(np.shape(x)[0])
    error = np.zeros(np.shape(x)[0])
    gradient = np.zeros(np.shape(x)[1])
    step = 0
    while step < iterations:
        if (enable_output):
            print("step: ", step, ", ", weights)
        pred = 1.0 / (1.0 + np.exp(-np.dot(x, weights)))
        error = pred - y
        gradient = np.dot(transx, error)
        weights = weights - (alpha * gradient)
        step += 1
    return weights

and using the breast cancer data https://raw.githubusercontent.com/STEllAR-GROUP/phylanx/master/examples/algorithms/datasets/breast_cancer.csv

will sometimes generate the files py-tree.txt py-graph.txt and sometimes not.

stevenrbrandt commented 4 years ago

Note there are no errors, nothing aborts, and result.py is written.

stevenrbrandt commented 4 years ago

It seems that there was, in fact a segfault

    at /usr/include/c++/7/ext/atomicity.h:81
81      if (__gthread_active_p())
#0  __gnu_cxx::__exchange_and_add_dispatch (__mem=0x7fff7819b818, __val=-1)
    at /usr/include/c++/7/ext/atomicity.h:81
#1  0x00000000004f2bfd in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fff7819b810) at /usr/include/c++/7/bits/shared_ptr_base.h:151
#2  0x00000000004ed015 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7fff700c3f08, __in_chrg=<optimized out>)
    at /usr/include/c++/7/bits/shared_ptr_base.h:684
#3  0x00007ffff6c33bfe in std::__shared_ptr<apex::task_wrapper, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fff700c3f00, __in_chrg=<optimized out>)
    at /usr/include/c++/7/bits/shared_ptr_base.h:1123
#4  0x00007ffff6c33c1a in std::shared_ptr<apex::task_wrapper>::~shared_ptr (
    this=0x7fff700c3f00, __in_chrg=<optimized out>)
    at /usr/include/c++/7/bits/shared_ptr.h:93
#5  0x00007fffef8057c2 in apex::task_wrapper::~task_wrapper (
    this=0x7fff700c3ee0, __in_chrg=<optimized out>)
---Type <return> to continue, or q <return> to quit---    at /hpx/apex/src/apex/task_wrapper.hpp:29
#6  0x00007fffef8057e2 in __gnu_cxx::new_allocator<apex::task_wrapper>::destroy<apex::task_wrapper> (this=0x7fff700c3ee0, __p=0x7fff700c3ee0)
    at /usr/include/c++/7/ext/new_allocator.h:140
#7  0x00007fffef805785 in std::allocator_traits<std::allocator<apex::task_wrapper> >::destroy<apex::task_wrapper> (__a=..., __p=0x7fff700c3ee0)
    at /usr/include/c++/7/bits/alloc_traits.h:487
#8  0x00007fffef805625 in std::_Sp_counted_ptr_inplace<apex::task_wrapper, std::allocator<apex::task_wrapper>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (
    this=0x7fff700c3ed0) at /usr/include/c++/7/bits/shared_ptr_base.h:535
stevenrbrandt commented 4 years ago

I'm wondering if this is the bug Kevin recently fixed?

stevenrbrandt commented 4 years ago
#9  0x00000000004f2c1e in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fff700c3ed0) at /usr/include/c++/7/bits/shared_ptr_base.h:154
#10 0x00000000004ed015 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7fff940ac848, __in_chrg=<optimized out>)
    at /usr/include/c++/7/bits/shared_ptr_base.h:684
#11 0x00007ffff6c33bfe in std::__shared_ptr<apex::task_wrapper, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fff940ac840, __in_chrg=<optimized out>)
    at /usr/include/c++/7/bits/shared_ptr_base.h:1123
#12 0x00007ffff6c33c1a in std::shared_ptr<apex::task_wrapper>::~shared_ptr (
    this=0x7fff940ac840, __in_chrg=<optimized out>)
    at /usr/include/c++/7/bits/shared_ptr.h:93
#13 0x00007fffef8057c2 in apex::task_wrapper::~task_wrapper (
    this=0x7fff940ac820, __in_chrg=<optimized out>)
---Type <return> to continue, or q <return> to quit---    at /hpx/apex/src/apex/task_wrapper.hpp:29
#14 0x00007fffef8057e2 in __gnu_cxx::new_allocator<apex::task_wrapper>::destroy<apex::task_wrapper> (this=0x7fff940ac820, __p=0x7fff940ac820)
    at /usr/include/c++/7/ext/new_allocator.h:140
#15 0x00007fffef805785 in std::allocator_traits<std::allocator<apex::task_wrapper> >::destroy<apex::task_wrapper> (__a=..., __p=0x7fff940ac820)
    at /usr/include/c++/7/bits/alloc_traits.h:487
#16 0x00007fffef805625 in std::_Sp_counted_ptr_inplace<apex::task_wrapper, std::allocator<apex::task_wrapper>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (
    this=0x7fff940ac810) at /usr/include/c++/7/bits/shared_ptr_base.h:535
stevenrbrandt commented 4 years ago
17 0x00000000004f2c1e in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fff940ac810) at /usr/include/c++/7/bits/shared_ptr_base.h:154
#18 0x00000000004ed015 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7fff900cefa8, __in_chrg=<optimized out>)
    at /usr/include/c++/7/bits/shared_ptr_base.h:684
#19 0x00007ffff6c33bfe in std::__shared_ptr<apex::task_wrapper, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fff900cefa0, __in_chrg=<optimized out>)
    at /usr/include/c++/7/bits/shared_ptr_base.h:1123
#20 0x00007ffff6c33c1a in std::shared_ptr<apex::task_wrapper>::~shared_ptr (
    this=0x7fff900cefa0, __in_chrg=<optimized out>)
    at /usr/include/c++/7/bits/shared_ptr.h:93
#21 0x00007fffef8057c2 in apex::task_wrapper::~task_wrapper (
    this=0x7fff900cef80, __in_chrg=<optimized out>)
---Type <return> to continue, or q <return> to quit---    at /hpx/apex/src/apex/task_wrapper.hpp:29
#22 0x00007fffef8057e2 in __gnu_cxx::new_allocator<apex::task_wrapper>::destroy<apex::task_wrapper> (this=0x7fff900cef80, __p=0x7fff900cef80)
    at /usr/include/c++/7/ext/new_allocator.h:140
#23 0x00007fffef805785 in std::allocator_traits<std::allocator<apex::task_wrapper> >::destroy<apex::task_wrapper> (__a=..., __p=0x7fff900cef80)
    at /usr/include/c++/7/bits/alloc_traits.h:487
#24 0x00007fffef805625 in std::_Sp_counted_ptr_inplace<apex::task_wrapper, std::allocator<apex::task_wrapper>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (
    this=0x7fff900cef70) at /usr/include/c++/7/bits/shared_ptr_base.h:535
khuck commented 4 years ago

I have seen something transient lately...only with mpi / distributed runs. I'll see what I can find