NanoComp / mpb

MIT Photonic-Bands: computation of photonic band structures in periodic media
GNU General Public License v2.0
167 stars 89 forks source link

MPI build segfaults in `make check` #8

Open ikirker opened 9 years ago

ikirker commented 9 years ago

I'm trying to build MPB 1.5 with:

The ordinary serial build and version using --with-inv-symmetry build fine and pass the make check tests, but the make check for the ordinary MPI build fails with the output shown below:

make[3]: Entering directory `/tmp/tmp.SoK8cAONll/mpb-1.5/mpb'
./mpb_mpi ../examples/check.ctl
**************************************************************************
 Test case: Square lattice of dielectric rods in air.
**************************************************************************
init-params: initializing eigensolver data
Computing 8 bands with 1.000000e-09 tolerance.
Working in 2 dimensions.
Grid size is 32 x 32 x 1.
Solving for 8 bands at a time.
Creating Maxwell data...
Allocating fields...
Mesh size is 3.
Lattice vectors:
     (1, 0, 0)
     (0, 1, 0)
     (0, 0, 1)
Cell volume = 1
Reciprocal lattice vectors (/ 2 pi):
     (1, 0, 0)
     (-0, 1, -0)
     (0, -0, 1)
Geometric objects:
     cylinder, center = (0,0,0)
          radius 0.2, height 1e+20, axis (0, 0, 1)
          dielectric constant epsilon = 11.56
Geometric object tree has depth 1 and 1 object nodes (vs. 1 actual objects)
Initializing dielectric function...
16 k-points:
     (0,0,0)
     (0.1,0,0)
     (0.2,0,0)
     (0.3,0,0)
     (0.4,0,0)
     (0.5,0,0)
     (0.5,0.1,0)
     (0.5,0.2,0)
     (0.5,0.3,0)
     (0.5,0.4,0)
     (0.5,0.5,0)
     (0.4,0.4,0)
     (0.3,0.3,0)
     (0.2,0.2,0)
     (0.1,0.1,0)
     (0,0,0)
Solving for band polarization: te.
Initializing fields to random numbers...
elapsed time for initialization: 0 seconds.
epsilon: 1-11.56, mean 2.327, harm. mean 1.1441, 14.5508% > 1, 12.5663% "fill"
Outputting check-epsilon...
solve_kpoint (0,0,0):
tefreqs:, k index, k1, k2, k3, kmag/2pi, te band 1, te band 2, te band 3, te band 4, te band 5, te band 6, te band 7, te band 8
Solving for bands 2 to 8...
make[3]: *** [check-local] Segmentation fault (core dumped)
make[3]: Leaving directory `/tmp/tmp.SoK8cAONll/mpb-1.5/mpb'
make[2]: *** [check-am] Error 2

Since I figured it might be useful, I hooked up gdb and took a look at the stack trace in the core dump, and got the following:

Core was generated by `./mpb_mpi ../examples/check.ctl'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000023f8870 in ?? ()
Missing separate debuginfos, use: debuginfo-install glibc-2.17-78.el7.x86_64 gmp-6.0.0-11.el7.x86_64 libffi-3.0.13-11.el7.x86_64 libunistring-0.9.3-9.el7.x86_64 ncurses-libs-5.9-13.20130511.el7.x86_64 nss-softokn-freebl-3.16.2.3-9.el7.x86_64 readline-6.2-9.el7.x86_64 zlib-1.2.7-13.el7.x86_64
(gdb) bt
#0  0x00000000023f8870 in ?? ()
#1  0x00007fe093ad65bc in fftw_execute_r2r () from /shared/ucl/apps/intel/2015/composer_xe_2015.2.164/mkl/lib/intel64/libmkl_intel_lp64.so
#2  0x00000000004637b7 in maxwell_compute_fft (dir=<optimized out>, d=<optimized out>, array_in=<optimized out>, array_out=<optimized out>, 
    howmany=<optimized out>, stride=<optimized out>, dist=<optimized out>) at maxwell_op.c:242
#3  maxwell_compute_d_from_H (d=0x7fffd2a4dfb0, Hin=..., dfield=0x7fe093f6e040, cur_band_start=-1812537280, cur_num_bands=37610592) at maxwell_op.c:417
#4  0x0000000000462ce4 in maxwell_operator (Xin=..., Xout=..., data=0x7fffd2a4dfb0, is_current_eigenvector=-1812537280, Work=...) at maxwell_op.c:1145
#5  0x0000000000451fc5 in eigensolver_lagrange (Y=..., eigenvals=<optimized out>, A=<optimized out>, Adata=<optimized out>, K=<optimized out>, 
    Kdata=<optimized out>, constraint=<optimized out>, constraint_data=<optimized out>, L=<optimized out>, Ldata=<optimized out>, lag=<optimized out>, 
    Work=<optimized out>, nWork=<optimized out>, tolerance=<optimized out>, num_iterations=<optimized out>, flags=68) at eigensolver.c:348
#6  eigensolver (Y=..., eigenvals=0x7fffd2a4dfb0, A=0x7fe093f6e040, Adata=0x7fe093f6e040, K=0x23de460, Kdata=0x2, constraint=0x2, 
    constraint_data=0x7fe000000000, Work=0x3f1b400000000000, nWork=0, tolerance=<unavailable>, num_iterations=0x40855d9f4b2bdd17, flags=1) at eigensolver.c:780
#7  0x000000000041db57 in solve_kpoint (kvector=...) at mpb.c:691
#8  0x00000000004474bb in solve_kpoint_aux (arg_scm_0=0x7fffd2a4dfb0) at ctl-io.c:2511
#9  0x00007fe0900b9bba in vm_regular_engine (vm=0x7fffd2a4dfb0, program=0x7fe093f6e040, argv=0x1a94678, nargs=4486272) at vm-i-system.c:855
#10 0x00007fe0900295b3 in scm_primitive_eval (exp=0x2111e30) at eval.c:692
#11 0x00007fe09004b3cb in scm_primitive_load (filename=<optimized out>) at load.c:124
#12 0x00007fe0900b9bba in vm_regular_engine (vm=0x7fffd2a4dfb0, program=0x7fe093f6e040, argv=0x1a941f0, nargs=-1878740240) at vm-i-system.c:855
#13 0x00007fe090028fd7 in scm_call_1 (proc=0x1cfccf0, arg1=0x21a38e0) at eval.c:486
#14 0x0000000000421c2a in main_entry (main_entry_data=0x7fffd2a4dfb0, argc=-1812537280, argv=0x7fe093f6e040) at main.c:292
#15 0x00007fe090045fad in invoke_main_func (body_data=0x7fffd2a4eed0) at init.c:336
#16 0x00007fe09001f71a in c_body (d=0x7fffd2a4ee20) at continuations.c:517
#17 0x00007fe0900b9baa in vm_regular_engine (vm=0x7fffd2a4dfb0, program=0x7fe093f6e040, argv=0x1a940b8, nargs=-1878540448) at vm-i-system.c:858
#18 0x00007fe0900290f3 in scm_call_4 (proc=0x1b70c30, arg1=arg1@entry=0x404, arg2=<optimized out>, arg3=<optimized out>, arg4=<optimized out>) at eval.c:507
#19 0x00007fe09009f1c9 in scm_catch_with_pre_unwind_handler (key=key@entry=0x404, thunk=<optimized out>, handler=<optimized out>, 
    pre_unwind_handler=<optimized out>) at throw.c:73
#20 0x00007fe09009f2cf in scm_c_catch (tag=tag@entry=0x404, body=body@entry=0x7fe09001f710 <c_body>, body_data=body_data@entry=0x7fffd2a4ee20, 
    handler=handler@entry=0x7fe09001faf0 <c_handler>, handler_data=handler_data@entry=0x7fffd2a4ee20, 
    pre_unwind_handler=pre_unwind_handler@entry=0x7fe09001f8a0 <pre_unwind_handler>, pre_unwind_handler_data=0x1b16ff0) at throw.c:207
#21 0x00007fe09001fe91 in scm_i_with_continuation_barrier (body=body@entry=0x7fe09001f710 <c_body>, body_data=body_data@entry=0x7fffd2a4ee20, 
    handler=handler@entry=0x7fe09001faf0 <c_handler>, handler_data=handler_data@entry=0x7fffd2a4ee20, 
    pre_unwind_handler=pre_unwind_handler@entry=0x7fe09001f8a0 <pre_unwind_handler>, pre_unwind_handler_data=0x1b16ff0) at continuations.c:455
#22 0x00007fe09001ff25 in scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:551
#23 0x00007fe09009cb2c in with_guile_and_parent (base=0x7fffd2a4ee80, data=0x7fffd2a4eea0) at threads.c:906
#24 0x00007fe08f192851 in GC_call_with_stack_base (fn=0x7fffd2a4dfb0, fn@entry=0x7fe09009cae0 <with_guile_and_parent>, arg=0x7fe093f6e040, 
    arg@entry=0x7fffd2a4eea0) at misc.c:1840
#25 0x00007fe09009cf18 in scm_i_with_guile_and_parent (parent=<optimized out>, data=data@entry=0x7fffd2a4eea0, 
    func=func@entry=0x7fe090045f90 <invoke_main_func>) at threads.c:949
#26 scm_with_guile (func=func@entry=0x7fe090045f90 <invoke_main_func>, data=data@entry=0x7fffd2a4eed0) at threads.c:955
#27 0x00007fe090046155 in scm_boot_guile (argc=<optimized out>, argv=<optimized out>, main_func=<optimized out>, closure=<optimized out>) at init.c:319
#28 0x0000000000421036 in main (argc=2, argv=0x7fffd2a4f068) at main.c:319
(gdb) 

And the ldd output looks like this:

$ ldd mpb_mpi
    linux-vdso.so.1 =>  (0x00007fff11744000)
    libmkl_intel_lp64.so => /shared/ucl/apps/intel/2015/composer_xe_2015.2.164/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007f940ecba000)
    libmkl_intel_thread.so => /shared/ucl/apps/intel/2015/composer_xe_2015.2.164/mkl/lib/intel64/libmkl_intel_thread.so (0x00007f940d899000)
    libmkl_core.so => /shared/ucl/apps/intel/2015/composer_xe_2015.2.164/mkl/lib/intel64/libmkl_core.so (0x00007f940bd3b000)
    libiomp5.so => /shared/ucl/apps/intel/2015/composer_xe_2015.2.164/compiler/lib/intel64/libiomp5.so (0x00007f940b9fe000)
    libguile-2.0.so.22 => /home/uccaiki/mpb_buildscript/test_build/lib/libguile-2.0.so.22 (0x00007f940b66f000)
    libffi.so.6 => /lib64/libffi.so.6 (0x00007f940b456000)
    libunistring.so.0 => /lib64/libunistring.so.0 (0x00007f940b13e000)
    libgmp.so.10 => /lib64/libgmp.so.10 (0x00007f940aec7000)
    libltdl.so.7 => /shared/ucl/apps/libtool/2.4.6/lib/libltdl.so.7 (0x00007f940acbd000)
    libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007f940aa85000)
    libgc.so.1 => /home/uccaiki/mpb_buildscript/test_build/lib/libgc.so.1 (0x00007f940a80e000)
    libhdf5.so.10 => /home/uccaiki/mpb_buildscript/test_build/lib/libhdf5.so.10 (0x00007f940a240000)
    libz.so.1 => /lib64/libz.so.1 (0x00007f940a029000)
    libfftw3_mpi.so.3 => /home/uccaiki/mpb_buildscript/test_build/lib/libfftw3_mpi.so.3 (0x00007f9409e11000)
    libfftw3.so.3 => /home/uccaiki/mpb_buildscript/test_build/lib/libfftw3.so.3 (0x00007f9409a85000)
    libimf.so => /shared/ucl/apps/intel/2015/composer_xe_2015.2.164/compiler/lib/intel64/libimf.so (0x00007f94095c9000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f94092c7000)
    libifport.so.5 => /shared/ucl/apps/intel/2015/composer_xe_2015.2.164/compiler/lib/intel64/libifport.so.5 (0x00007f940909a000)
    libifcore.so.5 => /shared/ucl/apps/intel/2015/composer_xe_2015.2.164/compiler/lib/intel64/libifcore.so.5 (0x00007f9408d67000)
    libsvml.so => /shared/ucl/apps/intel/2015/composer_xe_2015.2.164/compiler/lib/intel64/libsvml.so (0x00007f9407e94000)
    libirc.so => /shared/ucl/apps/intel/2015/composer_xe_2015.2.164/compiler/lib/intel64/libirc.so (0x00007f9407c39000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f9407a1c000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f9407818000)
    libmpifort.so.12 => /shared/ucl/apps/intel/2015/impi/5.0.3.048/intel64/lib/libmpifort.so.12 (0x00007f940758f000)
    libmpi.so.12 => /shared/ucl/apps/intel/2015/impi/5.0.3.048/intel64/lib/libmpi.so.12 (0x00007f9406e03000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f9406bfb000)
    libgcc_s.so.1 => /shared/ucl/apps/gcc/4.9.2/lib64/libgcc_s.so.1 (0x00007f94069e4000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f9406622000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f940f5cf000)
    libfreebl3.so => /lib64/libfreebl3.so (0x00007f940641f000)
    libirng.so => /shared/ucl/apps/intel/2015/composer_xe_2015.2.164/compiler/lib/intel64/libirng.so (0x00007f9406217000)
    libintlc.so.5 => /shared/ucl/apps/intel/2015/composer_xe_2015.2.164/compiler/lib/intel64/libintlc.so.5 (0x00007f9405fbc000)

Annoyingly, it looks like the MKL library contains symbols that match the names of FFTW ones, which are getting in the way. Maybe they're wrappers? I'm not using the actual interface wrapper libraries, which are in a different object file not linked here. I'm not sure if this is causing the problem, since I'm not sure if that final level is actually an FFTW object, but I was wondering if anyone had any suggestions or had seen this before?

I guess I could just rebuild everything with the GNU compilers and OpenBLAS, but I'd rather not if possible...

stevengj commented 9 years ago

Yes, we need the actual FFTW libraries, not the MKL ones, in order for the MPI stuff to work; there may be some conflict here.

I would recommend just using OpenBLAS. And the choice of compiler should make almost no difference here, since nearly all of the time in MPB is spent in FFTW and BLAS.

ikirker commented 9 years ago

@stevengj: Thanks, I've rebuilt with OpenBLAS, FFTW 3.3.4, OpenMPI, and GNU compilers, and it all seems to be working fine.

vanzod commented 6 years ago

@stevengj I have been hitting the same segmentation fault when building v1.6.2. Is there any update on this issue or any plan to find a solution in future releases?