dash-project / dash

DASH, the C++ Template Library for Distributed Data Structures with Support for Hierarchical Locality for HPC and Data-Driven Science
http://www.dash-project.org/
Other
155 stars 44 forks source link

dash::copy not working between containers in different teams #449

Open knuedd opened 6 years ago

knuedd commented 6 years ago

dash::copy (both, in global-to-global and in global-to-local mode) segfaults when one wants to copy between containers that have different teams associated to them.

The example where you can check this can be found in dash-apps --> multigrid/multigrid3d_elastic.cpp. This currently still needs the feat-halo branch.

... we talked about this at the project meeting last week. If you need more details, I'll be happy to bring them.

Thanks, Andreas

devreal commented 6 years ago

Andreas,

Thanks for opening a ticket, that helps tracking the issue. It's still not clear what is going wrong here... Before starting to debug this, do you happen to have a stack trace at hand?

knuedd commented 6 years ago
==== backtrace ====
 2 0x00000000000575cc mxm_handle_error()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u7-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.8.0-gcc-OFED-3.18-redhat6.7-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:641
 3 0x000000000005773c mxm_error_signal_handler()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u7-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.8.0-gcc-OFED-3.18-redhat6.7-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:616
 4 0x0000003afca32510 killpg()  ??:0
 5 0x0000003afca89782 memcpy()  ??:0
 6 0x000000000041262f _ZN4dash4copyIdNS_8GlobIterIdNS_12BlockPatternILi3ELNS_10MemArrangeE1ElEENS_13GlobStaticMemIdNS_9allocator18SymmetricAllocatorIdEEEENS_7GlobPtrIdS9_EENS_7GlobRefIdEEEEEEPT_T0_SH_SG_()  /sw/taurus/libraries/dash/dash-feat-halo_14-09-2017/include/dash/algorithm/Copy.h:878
 7 0x000000000040af83 _Z15transfertofewerR5LevelS0_()  /home/knuepfe/prog/dash-apps/multigrid/multigrid3d_elastic.cpp:611
 8 0x000000000040c358 _Z7v_cycleN9__gnu_cxx17__normal_iteratorIPKP5LevelSt6vectorIS2_SaIS2_EEEES8_jd()  /home/knuepfe/prog/dash-apps/multigrid/multigrid3d_elastic.cpp:852
 9 0x000000000040c79b _Z7v_cycleN9__gnu_cxx17__normal_iteratorIPKP5LevelSt6vectorIS2_SaIS2_EEEES8_jd()  /home/knuepfe/prog/dash-apps/multigrid/multigrid3d_elastic.cpp:903
10 0x000000000040c79b _Z7v_cycleN9__gnu_cxx17__normal_iteratorIPKP5LevelSt6vectorIS2_SaIS2_EEEES8_jd()  /home/knuepfe/prog/dash-apps/multigrid/multigrid3d_elastic.cpp:903
11 0x000000000040c79b _Z7v_cycleN9__gnu_cxx17__normal_iteratorIPKP5LevelSt6vectorIS2_SaIS2_EEEES8_jd()  /home/knuepfe/prog/dash-apps/multigrid/multigrid3d_elastic.cpp:903
12 0x000000000040da6b main()  /home/knuepfe/prog/dash-apps/multigrid/multigrid3d_elastic.cpp:1158
13 0x0000003afca1ed1d __libc_start_main()  ??:0
14 0x0000000000407771 _start()  ??:0
===================
devreal commented 6 years ago

This appears to be a bug somewhere in the pattern code. Here is what I have so far:

dash::copy first assumes that the copy is all local because the range returned by dash::local_index_range(in_first, in_last) has the length of the total_copy_elem. However, the call to in_first.local() returns nullptr because _pattern->local(idx) claims that the values are located on another unit.

I'm afraid that unless I'm spending significant amount of time paging through the pattern code I won't be of much help. I think this is a job for @fuchsto

fuchsto commented 6 years ago

@devreal Aye!