LLNL / MACSio

A Multi-purpose, Application-Centric, Scalable I/O Proxy Application
https://computing.llnl.gov/projects/co-design/macsio
Other
34 stars 16 forks source link

Segfault and core dump on Summit #31

Open williamfgc opened 3 years ago

williamfgc commented 3 years ago

I need some guidance on how to run MACSio on Summit. To reproduce: I was able to build successfully the MACSio binary with the following dependencies:

ldd ~/opt/macsio/macsio 
    linux-vdso64.so.1 =>  (0x00007fffb6120000)
    libjson-cwx.so.2 => /ccs/home/wgodoy/opt/json-cwx/lib/libjson-cwx.so.2 (0x00007fffb60e0000)
    libmpiprofilesupport.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/lib/libmpiprofilesupport.so.3 (0x00007fffb60b0000)
    libmpi_ibm.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/lib/libmpi_ibm.so.3 (0x00007fffb5f30000)
    libstdc++.so.6 => /sw/summit/gcc/6.4.0/lib64/libstdc++.so.6 (0x00007fffb5d20000)
    libm.so.6 => /lib64/libm.so.6 (0x00007fffb5c10000)
    libgcc_s.so.1 => /sw/summit/gcc/6.4.0/lib64/libgcc_s.so.1 (0x00007fffb5bd0000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fffb59e0000)
    librt.so.1 => /lib64/librt.so.1 (0x00007fffb59b0000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00007fffb5980000)
    libhwloc_ompi.so.15 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/lib/libhwloc_ompi.so.15 (0x00007fffb5910000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007fffb58e0000)
    libevent-2.1.so.6 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/lib/libevent-2.1.so.6 (0x00007fffb5860000)
    libevent_pthreads-2.1.so.6 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/lib/libevent_pthreads-2.1.so.6 (0x00007fffb5830000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fffb57f0000)
    libopen-rte.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/lib/libopen-rte.so.3 (0x00007fffb56e0000)
    libopen-pal.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/lib/libopen-pal.so.3 (0x00007fffb55f0000)
    /lib64/ld64.so.2 (0x00007fffb6140000)

Unfortunately, any combination of input parameters result in a core dump being emitted and a seg fault.

jsrun -n 2 ${MACSIO_EXEC} --interface hdf5 --parallel_file_mode MIF 2 --part_size 1M

Results:

cat output.490227 
[b28n03:147697] *** Process received signal ***
[b28n03:147697] Signal: Segmentation fault (11)
[b28n03:147697] Signal code: Address not mapped (1)
[b28n03:147697] Failing at address: 0x40
[b28n03:147697] [ 0] [0x2000000504d8]
[b28n03:147697] [ 1] [0x20000004d6b0]
[b28n03:147697] [ 2] /ccs/home/wgodoy/opt/json-cwx/lib/libjson-cwx.so.2(json_object_set_string+0x58)[0x2000000f9838]
[b28n03:147697] [ 3] /ccs/home/wgodoy/opt/json-cwx/lib/libjson-cwx.so.2(json_object_path_set_string+0x34)[0x2000000fa764]
[b28n03:147697] [ 4] /ccs/home/wgodoy/opt/macsio/macsio[0x10018cc0]
[b28n03:147697] [ 5] /ccs/home/wgodoy/opt/macsio/macsio(main+0x900)[0x10005980]
[b28n03:147697] [ 6] /lib64/libc.so.6(+0x25200)[0x200000655200]
[b28n03:147697] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x2000006553f4]
[b28n03:147697] *** End of error message ***
[b28n03:147696] *** Process received signal ***
[b28n03:147696] Signal: Segmentation fault (11)
[b28n03:147696] Signal code: Address not mapped (1)
[b28n03:147696] Failing at address: 0x40
[b28n03:147696] [ 0] [0x2000000504d8]
[b28n03:147696] [ 1] [0x20000004d6b0]
[b28n03:147696] [ 2] /ccs/home/wgodoy/opt/json-cwx/lib/libjson-cwx.so.2(json_object_set_string+0x58)[0x2000000f9838]
[b28n03:147696] [ 3] /ccs/home/wgodoy/opt/json-cwx/lib/libjson-cwx.so.2(json_object_path_set_string+0x34)[0x2000000fa764]
[b28n03:147696] [ 4] /ccs/home/wgodoy/opt/macsio/macsio[0x10018cc0]
[b28n03:147696] [ 5] /ccs/home/wgodoy/opt/macsio/macsio(main+0x900)[0x10005980]
[b28n03:147696] [ 6] /lib64/libc.so.6(+0x25200)[0x200000655200]
[b28n03:147696] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x2000006553f4]
[b28n03:147696] *** End of error message ***
ERROR:  One or more process (first noticed rank 0) terminated with signal 11 (core dumped)

macsio-log.log:

--------------------------------------------------------Processor 000000-------------------------------------------------------

Any help would be appreciated!

markcmiller86 commented 3 years ago

Which version of MACSio are you running? And, can you possible attach your CMakeCache.txt file here?

williamfgc commented 3 years ago

@markcmiller86 thanks for the quick response. I'm building the current master branch using gcc 6.4.0. Please find attached the CMakeCache.txt file.

markcmiller86 commented 3 years ago

Do you think I should be able to duplicate behavior on LLNL's own Lassen system

williamfgc commented 3 years ago

@markcmiller86 that's a good idea to make sure it's not a Summit problem, I'll try to build locally as well.

williamfgc commented 3 years ago

@markcmiller86 just following up on this after the break. I built version 1.1 on Summit as it doesn't have the seg fault. Hope it helps.

markcmiller86 commented 3 years ago

Sorry for delay. That does help. I just don't have confidence to replicate the issue on LLNL's Lassen system and am, at the moment, up to my ears in other tasks. Please ping me again in a week.