ValeevGroup / tiledarray

A massively-parallel, block-sparse tensor framework written in C++
GNU General Public License v3.0
263 stars 54 forks source link

error with make_replicated() #74

Open pchong90 opened 8 years ago

pchong90 commented 8 years ago

I observed this error from mp2_f12_expression in MPQC4.

The error doesn't occur every time. It seems like something is wrong with make_replicated when calling array_to_eigen at computing inverse of two body two center integral.

Error Message

A madness exception occurred. Place a break point at madness::exception_break to debug.
libc++abi.dylib: terminating with uncaught exception of type madness::MadnessException: MADNESS ASSERTION FAILED: "/Users/ChongPeng/Workspace/Development/source/tiledarray/external/src/madness/src/madness/world/future.h"(966)
[Chong_Computer:36701] *** Process received signal ***
[Chong_Computer:36701] Signal: Abort trap: 6 (6)
[Chong_Computer:36701] Signal code:  (0)
[Chong_Computer:36701] [ 0] 0   libsystem_platform.dylib            0x00007fff9690d52a _sigtramp + 26
[Chong_Computer:36701] [ 1] 0   ???                                 0x000000000358d4bd 0x0 + 56153277
[Chong_Computer:36701] [ 2] 0   libsystem_c.dylib                   0x00007fff91db96e7 abort + 129
[Chong_Computer:36701] [ 3] 0   libc++abi.dylib                     0x00007fff937b7f81 __cxa_bad_cast + 0
[Chong_Computer:36701] [ 4] 0   libc++abi.dylib                     0x00007fff937dda2f _ZL25default_terminate_handlerv + 243
[Chong_Computer:36701] [ 5] 0   libobjc.A.dylib                     0x00007fff91f8f4a6 _ZL15_objc_terminatev + 124
[Chong_Computer:36701] [ 6] 0   libc++abi.dylib                     0x00007fff937db19e _ZSt11__terminatePFvvE + 8
[Chong_Computer:36701] [ 7] 0   libc++abi.dylib                     0x00007fff937dac12 _ZN10__cxxabiv1L22exception_cleanup_funcE19_Unwind_Reason_CodeP17_Unwind_Exception + 0
[Chong_Computer:36701] [ 8] 0   libtiledarray.dylib                 0x0000000102cb1d75 _ZN7madness7archive16ArchiveStoreImplINS0_19BufferOutputArchiveENSt3__16vectorINS_6FutureIN10TiledArray6TensorIdN5Eigen17aligned_allocatorIdEEEEEENS3_9allocatorISC_EEEEE5storeERKS2_RKSF_ + 613
[Chong_Computer:36701] [ 9] 0   libtiledarray.dylib                 0x0000000102cb1af6 _ZN7madness7archive11ArchiveImplINS0_19BufferOutputArchiveENSt3__16vectorINS_6FutureIN10TiledArray6TensorIdN5Eigen17aligned_allocatorIdEEEEEENS3_9allocatorISC_EEEEE10wrap_storeERKS2_RKSF_ + 38
[Chong_Computer:36701] [10] 0   libtiledarray.dylib                 0x0000000102cb1abd _ZN7madness7archiveanINS0_19BufferOutputArchiveENSt3__16vectorINS_6FutureIN10TiledArray6TensorIdN5Eigen17aligned_allocatorIdEEEEEENS3_9allocatorISC_EEEEEENS3_9enable_ifIXsr17is_output_archiveIT_EE5valueERKSH_E4typeESJ_RKT0_ + 29
[Chong_Computer:36701] [11] 0   libtiledarray.dylib                 0x0000000102cb19e8 _ZN7madness17serialize_am_argsIRKNS_7archive19BufferOutputArchiveERKNSt3__16vectorINS_6FutureIN10TiledArray6TensorIdN5Eigen17aligned_allocatorIdEEEEEENS5_9allocatorISE_EEEEJRKNS7_IvEESM_SM_SM_SM_SM_SM_EEEvOT_OT0_DpOT1_ + 72
[Chong_Computer:36701] [12] 0   libtiledarray.dylib                 0x0000000102cb1987 _ZN7madness17serialize_am_argsINS_7archive19BufferOutputArchiveERKNSt3__16vectorImNS3_9allocatorImEEEEJRKNS4_INS_6FutureIN10TiledArray6TensorIdN5Eigen17aligned_allocatorIdEEEEEENS5_ISH_EEEERKNSA_IvEESO_SO_SO_SO_SO_SO_EEEvOT_OT0_DpOT1_ + 231
[Chong_Computer:36701] [13] 0   libtiledarray.dylib                 0x0000000102cb1764 _ZN7madness17serialize_am_argsIRNS_7archive19BufferOutputArchiveERKNS_6detail4infoIMN10TiledArray6detail10ReplicatorINS6_9DistArrayINS6_6TensorIdN5Eigen17aligned_allocatorIdEEEENS6_12SparsePolicyEEEEEFvRKNSt3__16vectorImNSI_9allocatorImEEEERKNSJ_INS_6FutureISE_EENSK_ISQ_EEEEEEEJSO_SU_RKNSP_IvEES12_S12_S12_S12_S12_S12_EEEvOT_OT0_DpOT1_ + 276
[Chong_Computer:36701] [14] 0   libtiledarray.dylib                 0x0000000102caf32b _ZN7madness10new_am_argIJNS_6detail4infoIMN10TiledArray6detail10ReplicatorINS3_9DistArrayINS3_6TensorIdN5Eigen17aligned_allocatorIdEEEENS3_12SparsePolicyEEEEEFvRKNSt3__16vectorImNSF_9allocatorImEEEERKNSG_INS_6FutureISB_EENSH_ISN_EEEEEEESJ_SP_NSM_IvEESV_SV_SV_SV_SV_SV_EEEPNS_5AmArgEDpRKT_ + 155
[Chong_Computer:36701] [15] 0   libtiledarray.dylib                 0x0000000102cf26bd _ZNK7madness11WorldObjectIN10TiledArray6detail10ReplicatorINS1_9DistArrayINS1_6TensorIdN5Eigen17aligned_allocatorIdEEEENS1_12SparsePolicyEEEEEE9send_taskINS_6TaskFnINS_6detail14MemFuncWrapperIPSC_MSC_FvRKNSt3__16vectorImNSJ_9allocatorImEEEERKNSK_INS_6FutureIS9_EENSL_ISR_EEEEEvEESN_ST_vvvvvvvEESX_SN_ST_NSQ_IvEES10_S10_S10_S10_S10_S10_EENT_7futureTEiT0_RKT1_RKT2_RKT3_RKT4_RKT5_RKT6_RKT7_RKT8_RKT9_RKNS_14TaskAttributesE + 461
[Chong_Computer:36701] [16] 0   libtiledarray.dylib                 0x0000000102cf1fa9 _ZNK7madness11WorldObjectIN10TiledArray6detail10ReplicatorINS1_9DistArrayINS1_6TensorIdN5Eigen17aligned_allocatorIdEEEENS1_12SparsePolicyEEEEEE4taskIMSC_FvRKNSt3__16vectorImNSF_9allocatorImEEEERKNSG_INS_6FutureIS9_EENSH_ISN_EEEEESJ_SP_EENS_6detail16task_result_typeIT_E7futureTEiSW_RKT0_RKT1_RKNS_14TaskAttributesE + 409
[Chong_Computer:36701] [17] 0   libtiledarray.dylib                 0x0000000102cf1d94 _ZN10TiledArray6detail10ReplicatorINS_9DistArrayINS_6TensorIdN5Eigen17aligned_allocatorIdEEEENS_12SparsePolicyEEEE4sendEv + 244
[Chong_Computer:36701] [18] 0   libtiledarray.dylib                 0x0000000102cf130f _ZN10TiledArray6detail10ReplicatorINS_9DistArrayINS_6TensorIdN5Eigen17aligned_allocatorIdEEEENS_12SparsePolicyEEEE10delay_sendEv + 47
[Chong_Computer:36701] [19] 0   libtiledarray.dylib                 0x0000000102cf11e0 _ZN10TiledArray6detail10ReplicatorINS_9DistArrayINS_6TensorIdN5Eigen17aligned_allocatorIdEEEENS_12SparsePolicyEEEEC2ERKS9_S9_ + 3968
[Chong_Computer:36701] [20] 0   libtiledarray.dylib                 0x0000000102c8015d _ZN10TiledArray6detail10ReplicatorINS_9DistArrayINS_6TensorIdN5Eigen17aligned_allocatorIdEEEENS_12SparsePolicyEEEEC1ERKS9_S9_ + 29
[Chong_Computer:36701] [21] 0   libtiledarray.dylib                 0x0000000102c7fee3 _ZN10TiledArray9DistArrayINS_6TensorIdN5Eigen17aligned_allocatorIdEEEENS_12SparsePolicyEE15make_replicatedEv + 467
[Chong_Computer:36701] [22] 0   mp2_f12_expression                  0x0000000100028b86 _ZN4mpqc9array_ops14array_to_eigenIdN10TiledArray12SparsePolicyEEEN5Eigen6MatrixIT_Lin1ELin1ELi1ELin1ELin1EEERKNS2_9DistArrayINS2_6TensorIS6_NS4_17aligned_allocatorIS6_EEEET0_EE + 390
[Chong_Computer:36701] [23] 0   mp2_f12_expression                  0x00000001000f4998 _ZN4mpqc9integrals14AtomicIntegralIN10TiledArray6TensorIdN5Eigen17aligned_allocatorIdEEEENS2_12SparsePolicyEE8compute2ERKNS_7FormulaE + 3720
[Chong_Computer:36701] [24] 0   mp2_f12_expression                  0x00000001000f2f91 _ZN4mpqc9integrals14AtomicIntegralIN10TiledArray6TensorIdN5Eigen17aligned_allocatorIdEEEENS2_12SparsePolicyEE7computeERKNS_7FormulaE + 657
[Chong_Computer:36701] [25] 0   mp2_f12_expression                  0x0000000100027c0a _ZN4mpqc9integrals14AtomicIntegralIN10TiledArray6TensorIdN5Eigen17aligned_allocatorIdEEEENS2_12SparsePolicyEE7computeERKNSt3__112basic_stringIwNSA_11char_traitsIwEENSA_9allocatorIwEEEE + 122
[Chong_Computer:36701] [26] 0   mp2_f12_expression                  0x000000010025e336 _ZN4mpqc9integrals17MolecularIntegralIN10TiledArray6TensorIdN5Eigen17aligned_allocatorIdEEEENS2_12SparsePolicyEE8compute4ERKNS_7FormulaE + 486
[Chong_Computer:36701] [27] 0   mp2_f12_expression                  0x000000010025bd61 _ZN4mpqc9integrals17MolecularIntegralIN10TiledArray6TensorIdN5Eigen17aligned_allocatorIdEEEENS2_12SparsePolicyEE7computeERKNS_7FormulaE + 993
[Chong_Computer:36701] [28] 0   mp2_f12_expression                  0x000000010002af6a _ZN4mpqc9integrals17MolecularIntegralIN10TiledArray6TensorIdN5Eigen17aligned_allocatorIdEEEENS2_12SparsePolicyEEclERKNSt3__112basic_stringIwNSA_11char_traitsIwEENSA_9allocatorIwEEEE + 122
[Chong_Computer:36701] [29] 0   mp2_f12_expression                  0x0000000100007537 main + 18231
[Chong_Computer:36701] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 36701 on node Chong_Computer exited on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------

function array_to_eigen

template <typename T, typename Policy>
    Matrix<T> array_to_eigen(TA::DistArray<TA::Tensor<T>, Policy> const &A) {

        TA_ASSERT(A.range().rank() == 2);

        auto const &mat_extent = A.trange().elements().extent();
        Matrix<T> out_mat = Matrix<T>::Zero(mat_extent[0], mat_extent[1]);

        // Copy A and make it replicated.  Making A replicated is a mutating op.
        auto repl_A = A;
        repl_A.make_replicated();

        // Loop over the array and assign the tiles to blocks of the Eigen Mat.
        auto pmap = repl_A.get_pmap();
        const auto end = pmap->end();
        for (auto it = pmap->begin(); it != end; ++it) {
            if (!repl_A.is_zero(*it)) {
                auto tile = repl_A.find(*it).get();
                A.get_world().taskq.add(write_to_eigen_task<TA::Tensor<T>>, tile, &out_mat);
            }
        }
        A.get_world().gop.fence(); // Can't let M go out of scope

        return out_mat;
    }

after adding fence() before make_replicated() the error seems to disappear.

template <typename T, typename Policy>
    Matrix<T> array_to_eigen(TA::DistArray<TA::Tensor<T>, Policy> const &A) {

        TA_ASSERT(A.range().rank() == 2);

        auto const &mat_extent = A.trange().elements().extent();
        Matrix<T> out_mat = Matrix<T>::Zero(mat_extent[0], mat_extent[1]);

        // Copy A and make it replicated.  Making A replicated is a mutating op.
        auto repl_A = A;
        A.get_world().gop.fence();
        repl_A.make_replicated();

        // Loop over the array and assign the tiles to blocks of the Eigen Mat.
        auto pmap = repl_A.get_pmap();
        const auto end = pmap->end();
        for (auto it = pmap->begin(); it != end; ++it) {
            if (!repl_A.is_zero(*it)) {
                auto tile = repl_A.find(*it).get();
                A.get_world().taskq.add(write_to_eigen_task<TA::Tensor<T>>, tile, &out_mat);
            }
        }
        A.get_world().gop.fence(); // Can't let M go out of scope

        return out_mat;
    }
pchong90 commented 8 years ago

By the way, I am using TA at commit 253ea2e and MADNESS at commit 20f2f61

justusc commented 8 years ago

I am not sure how DistArray::make_replicated() is failing due to an unset future. The algorithm is designed to handle such a situation.

@pchong90 You are likely seeing hangs with this algorithm because the MADNESS receive buffer is too small and the huge message protocol is being used. DistArray::make_replicated() sends all local tiles in one message. I will need to address this behavior as well.