SebWouters / CheMPS2

CheMPS2: a spin-adapted implementation of DMRG for ab initio quantum chemistry
GNU General Public License v2.0
68 stars 34 forks source link

OpenMP instability #50

Closed loriab closed 7 years ago

loriab commented 7 years ago

So the CheMPS2 tests have never run for me with Intel compilers/MKL math. Usually 7 & 13 pass and all the rest segfault. Vexing, but normal. (Note that I've never had these problems with CheMPS2 w/i Psi4, just in the standalone testing and binary.)

Then, as I was adding the static-CheMPS2-lib-with-fpic capability, I found a configuration that actually ran and passed all the test cases. Seizing this, I investigated further.

STATIC/SHARED_ONLY BUILD_FPIC ENABLE_OPENMP Misc. Result
static off off all tests pass
static off on all tests pass
static on off all tests pass
static on on all segfault except 7 & 13 pass
static on on fn1 all tests pass
shared -- off all tests pass
shared -- on all segfault except 7 & 13 pass
shared -- on fn1 all tests pass

fn1: set_source_files_properties(Sobject.cpp PROPERTIES COMPILE_FLAGS "-qno-openmp")

So, several dozen compilations later, I think there's something wrong with threading in your Sobject.cpp file. The below is a temporary solution to get around the problem. My compiler is icpc (ICC) 16.0.2 20160204.

if (BUILD_FPIC OR NOT STATIC_ONLY)
    set_target_properties (chemps2-base PROPERTIES POSITION_INDEPENDENT_CODE 1)
    if(CMAKE_CXX_COMPILER_ID MATCHES Intel)
        set_source_files_properties(Sobject.cpp PROPERTIES COMPILE_FLAGS "-qno-openmp")
    endif()
endif()

Being able to actually use the chemps2 executable is going to be great for comparing CheMPS2 and Psi4+CheMPS2 results.

SebWouters commented 7 years ago

@loriab

Which version of the intel compiler and MKL are you using? Normally it is only tested with the intel compiler and MKL :-). So I find it a bit weird that I never caught it.

Did you compile on the machine on which you ran? Did you put -DENABLE_XHOST=OFF if they are different in hardware somehow?

What does gdb say?

-- Seb

SebWouters commented 7 years ago

@loriab

Can you also provide your cmake setup command (the options).

It will be most likely due to https://github.com/SebWouters/CheMPS2/blob/master/CheMPS2/Sobject.cpp#L415, which should be perfectly OK.

My setup:

seba@latitude-7350:~/Desktop/CheMPS2_github/build$ CXX=icpc CC=icc cmake .. -DMKL=ON -DCMAKE_INSTALL_PREFIX=/usr 
-- The C compiler identification is Intel 16.0.3.20160415
-- The CXX compiler identification is Intel 16.0.3.20160415
-- Check for working C compiler: /opt/intel/compilers_and_libraries_2016.3.210/linux/bin/intel64/icc
-- Check for working C compiler: /opt/intel/compilers_and_libraries_2016.3.210/linux/bin/intel64/icc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /opt/intel/compilers_and_libraries_2016.3.210/linux/bin/intel64/icpc
-- Check for working CXX compiler: /opt/intel/compilers_and_libraries_2016.3.210/linux/bin/intel64/icpc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Try OpenMP C flag = [-qopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Try OpenMP CXX flag = [-qopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Found OpenMP: -qopenmp  
-- Performing Test HAS_XHOST
-- Performing Test HAS_XHOST - Success
-- Performing Test HAS_MARCH_NATIVE
-- Performing Test HAS_MARCH_NATIVE - Success
-- Performing Test HAS_IPO
-- Performing Test HAS_IPO - Success
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Failed
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - found
-- Found Threads: TRUE  
-- Looking for sgemm_
-- Looking for sgemm_ - found
-- A library with BLAS API found.
-- Looking for cheev_
-- Looking for cheev_ - found
-- A library with LAPACK API found.
-- Found HDF5: /usr/lib/x86_64-linux-gnu/hdf5/serial/lib/libhdf5.so;/usr/lib/x86_64-linux-gnu/libpthread.so;/usr/lib/x86_64-linux-gnu/libsz.so;/usr/lib/x86_64-linux-gnu/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so (found version "1.8.16") 
-- Configuring done
-- Generating done
-- Build files have been written to: /home/seba/Desktop/CheMPS2_github/build

Libraries used by test1:

seba@latitude-7350:~/Desktop/CheMPS2_github/build/tests$ ldd test1
linux-vdso.so.1 =>  (0x00007ffec2964000)
libchemps2.so.2 => /home/seba/Desktop/CheMPS2_github/build/CheMPS2/libchemps2.so.2 (0x00007f950f6ac000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f950f310000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f950f006000)
libiomp5.so => /opt/intel/compilers_and_libraries_2016.3.210/linux/compiler/lib/intel64/libiomp5.so (0x00007f950ecc2000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f950eaac000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f950e88e000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f950e4c5000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f950e2c1000)
libmkl_intel_lp64.so => /opt/intel/compilers_and_libraries_2016.3.210/linux/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007f950d7b0000)
libmkl_intel_thread.so => /opt/intel/compilers_and_libraries_2016.3.210/linux/mkl/lib/intel64/libmkl_intel_thread.so (0x00007f950be83000)
libmkl_core.so => /opt/intel/compilers_and_libraries_2016.3.210/linux/mkl/lib/intel64/libmkl_core.so (0x00007f950a472000)
libhdf5_serial.so.10 => /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10 (0x00007f9509fd5000)
libsz.so.2 => /usr/lib/x86_64-linux-gnu/libsz.so.2 (0x00007f9509dd2000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f9509bb7000)
libimf.so => /opt/intel/compilers_and_libraries_2016.3.210/linux/compiler/lib/intel64/libimf.so (0x00007f95096b9000)
libsvml.so => /opt/intel/compilers_and_libraries_2016.3.210/linux/compiler/lib/intel64/libsvml.so (0x00007f95087ad000)
libirng.so => /opt/intel/compilers_and_libraries_2016.3.210/linux/compiler/lib/intel64/libirng.so (0x00007f950843a000)
libintlc.so.5 => /opt/intel/compilers_and_libraries_2016.3.210/linux/compiler/lib/intel64/libintlc.so.5 (0x00007f95081ce000)
/lib64/ld-linux-x86-64.so.2 (0x000055dccf1ff000)
libaec.so.0 => /usr/lib/x86_64-linux-gnu/libaec.so.0 (0x00007f9507fc5000)

and test1 runs perfectly fine with several threads...

I'm on Ubuntu 16.04. Are you on MAC perhaps?

-- Seb

loriab commented 7 years ago

Yes, I'm surprised this hasn't been seen either, since, as you say, Intel/MKL is your primary workflow. I always supposed I was just doing something stupid that the tests wouldn't run and was surprised to find compile conditions that would isolate it. Below are some answers and profiles. Hopefully something strikes you as relevant.

Questions

This is Linux, RHEL7, Linux psinet 3.10.0-327.4.5.el7.x86_64 #1 SMP Thu Jan 21 04:10:29 EST 2016 x86_64 x86_64 x86_64 GNU/Linux. No cross-compilation going on; that is, same machine for building and running. ENABLE_XHOST setting doesn't matter. Intel compiler version icpc (ICC) 16.0.2 20160204 and MKL presumably 11.3.2.

setup commands

source /path/to/intel2016/bin/compilervars.sh intel64

cmake -H. -Bobjdir1 \
 -DCMAKE_C_COMPILER=icc \
 -DCMAKE_CXX_COMPILER=icpc \
 -DSHARED_ONLY=ON \
 -DENABLE_XHOST=OFF \
 -DMKL=ON \
 -DCMAKE_INSTALL_PREFIX=/path/to/install-chemps2

CMake output

-- The C compiler identification is Intel 16.0.2.20160204
-- The CXX compiler identification is Intel 16.0.2.20160204
-- Check for working C compiler: /path/to/intel2016/compilers_and_libraries_2016.2.181/linux/bin/intel64/icc
-- Check for working C compiler: /path/to/intel2016/compilers_and_libraries_2016.2.181/linux/bin/intel64/icc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /path/to/intel2016/compilers_and_libraries_2016.2.181/linux/bin/intel64/icpc
-- Check for working CXX compiler: /path/to/intel2016/compilers_and_libraries_2016.2.181/linux/bin/intel64/icpc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Try OpenMP C flag = [-qopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Try OpenMP CXX flag = [-qopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Found OpenMP: -qopenmp  
-- Performing Test HAS_IPO
-- Performing Test HAS_IPO - Success
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Failed
-- Looking for include file pthread.h
-- Looking for include file pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - found
-- Found Threads: TRUE  
-- Looking for sgemm_
-- Looking for sgemm_ - found
-- A library with BLAS API found.
-- Looking for cheev_
-- Looking for cheev_ - found
-- A library with LAPACK API found.
-- Found HDF5: /path/to/miniconda/envs/py2basics/lib/libhdf5.so;/usr/lib64/librt.so;/path/to/miniconda/envs/py2basics/lib/libz.so;/usr/lib64/libdl.so;/path/to/miniconda/lib/libm.so (found version "1.8.17") 
-- Configuring done
-- Generating done
-- Build files have been written to: /path/to/CheMPS2/objdir1
...
Test project /theoryfs2/ds/cdsgroup/psi4-compile/externals/CheMPS2/objdir1
      Start  1: test1
 1/14 Test  #1: test1 ............................***Exception: SegFault  0.11 sec
      Start  2: test2
 2/14 Test  #2: test2 ............................***Exception: SegFault  0.27 sec
      Start  3: test3
 3/14 Test  #3: test3 ............................***Exception: SegFault  0.12 sec
      Start  4: test4
 4/14 Test  #4: test4 ............................***Exception: SegFault  0.13 sec
      Start  5: test5
 5/14 Test  #5: test5 ............................***Exception: SegFault  0.12 sec
      Start  6: test6
 6/14 Test  #6: test6 ............................***Exception: SegFault  0.27 sec
      Start  7: test7
 7/14 Test  #7: test7 ............................   Passed    0.05 sec
      Start  8: test8
 8/14 Test  #8: test8 ............................***Exception: SegFault  0.09 sec
      Start  9: test9
 9/14 Test  #9: test9 ............................***Exception: SegFault  0.23 sec
      Start 10: test10
10/14 Test #10: test10 ...........................***Exception: SegFault  0.12 sec
      Start 11: test11
11/14 Test #11: test11 ...........................***Exception: SegFault  0.13 sec
      Start 12: test12
12/14 Test #12: test12 ...........................***Exception: SegFault  0.14 sec
      Start 13: test13
13/14 Test #13: test13 ...........................   Passed    0.83 sec
      Start 14: test14
14/14 Test #14: test14 ...........................***Exception: SegFault  0.26 sec

Libraries used by Test

ldd tests/test1
    linux-vdso.so.1 =>  (0x00007ffe247fc000)
    libchemps2.so.2 => /theoryfs2/ds/cdsgroup/psi4-compile/externals/CheMPS2/objdir1/CheMPS2/libchemps2.so.2 (0x00007f7ca02d5000)
    libmkl_intel_lp64.so => /theoryfs2/common/software/intel2016/compilers_and_libraries_2016.2.181/linux/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007f7c9f96e000)
    libmkl_intel_thread.so => /theoryfs2/common/software/intel2016/compilers_and_libraries_2016.2.181/linux/mkl/lib/intel64/libmkl_intel_thread.so (0x00007f7c9e319000)
    libmkl_core.so => /theoryfs2/common/software/intel2016/compilers_and_libraries_2016.2.181/linux/mkl/lib/intel64/libmkl_core.so (0x00007f7c9c991000)
    libiomp5.so => /theoryfs2/common/software/intel2016/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libiomp5.so (0x00007f7c9c650000)
    libm.so.6 => /theoryfs2/ds/cdsgroup/miniconda/lib/libm.so.6 (0x00000032b9600000)
    libhdf5.so.10 => /theoryfs2/ds/cdsgroup/miniconda/envs/py2basics/lib/libhdf5.so.10 (0x00007f7c9c16f000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f7c9bf41000)
    libz.so.1 => /theoryfs2/ds/cdsgroup/miniconda/envs/py2basics/lib/libz.so.1 (0x00007f7c9bd2b000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f7c9bb27000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f7c9b81d000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f7c9b607000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f7c9b3eb000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f7c9b029000)
    libimf.so => /theoryfs2/common/software/intel2016/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libimf.so (0x00007f7c9ab2c000)
    libsvml.so => /theoryfs2/common/software/intel2016/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libsvml.so (0x00007f7c99c6f000)
    libirng.so => /theoryfs2/common/software/intel2016/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libirng.so (0x00007f7c9990f000)
    libintlc.so.5 => /theoryfs2/common/software/intel2016/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libintlc.so.5 (0x00007f7c996a2000)

Comparing the ldd profiles, the only things I notice are that (1) you additionally link to libsz and libaec, while I link to librt and (2) you are linking to libhdf5_serial while I link to libhdf5. This latter point is perhaps a lead. I don't see the hdf5_serial in the Found HDF5 bit of your cmake output.

gdb

Using the ./chemps2 --file=test14.in test since that's easily accessible as a command for gdb.

cdsgroup@bash:psinet:/theoryfs2/ds/cdsgroup/psi4-compile/externals/CheMPS2/objdir1/CheMPS2: (cmtarget) gdb ./chemps2 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /theoryfs2/ds/cdsgroup/psi4-compile/externals/CheMPS2/objdir1/CheMPS2/chemps2...(no debugging symbols found)...done.
(gdb) run --file=test14.input
Starting program: /theoryfs2/ds/cdsgroup/psi4-compile/externals/CheMPS2/objdir1/CheMPS2/./chemps2 --file=test14.input
Missing separate debuginfo for /theoryfs2/ds/cdsgroup/miniconda/lib/libm.so.6
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Running chemps2 version 1.8-1 (2016-08-24) with the following options:

   FCIDUMP            = /theoryfs2/ds/cdsgroup/psi4-compile/externals/CheMPS2/tests/matrixelements/N2.CCPVDZ.FCIDUMP
   GROUP              = d2h
   MULTIPLICITY       = 1
   NELECTRONS         = 14
   IRREP              = Ag
   EXCITATION         = 0
   SWEEP_STATES       = [ 500 ; 1000 ]
   SWEEP_ENERGY_CONV  = [ 1e-10 ; 1e-10 ]
   SWEEP_MAX_SWEEPS   = [ 3 ; 10 ]
   SWEEP_NOISE_PREFAC = [ 0.1 ; 0 ]
   SWEEP_DVDSON_RTOL  = [ 1e-05 ; 1e-10 ]
   NOCC               = [ 1 ; 0 ; 0 ; 0 ; 0 ; 1 ; 0 ; 0 ]
   NACT               = [ 4 ; 0 ; 1 ; 1 ; 0 ; 4 ; 1 ; 1 ]
   NVIR               = [ 2 ; 1 ; 2 ; 2 ; 1 ; 2 ; 2 ; 2 ]
   SCF_STATE_AVG      = FALSE
   SCF_DIIS_THR       = 0.01
   SCF_GRAD_THR       = 1e-07
   SCF_MAX_ITER       = 50
   SCF_ACTIVE_SPACE   = L : localized and ordered orbitals
   SCF_MOLDEN         = 
   CASPT2_CALC        = TRUE
   CASPT2_ORBS        = A : as specified in SCF_ACTIVE_SPACE
   CASPT2_IPEA        = 0
   CASPT2_IMAG        = 0
   CASPT2_CHECKPT     = TRUE
   CASPT2_CUMUL       = FALSE
   PRINT_CORR         = FALSE
   TMP_FOLDER         = /tmp

CheMPS2::Hamiltonian : Reading FCIDUMP file /theoryfs2/ds/cdsgroup/psi4-compile/externals/CheMPS2/tests/matrixelements/N2.CCPVDZ.FCIDUMP
CheMPS2::Hamiltonian : Finished reading FCIDUMP file /theoryfs2/ds/cdsgroup/psi4-compile/externals/CheMPS2/tests/matrixelements/N2.CCPVDZ.FCIDUMP
NORB  = [ 7 , 1 , 3 , 3 , 1 , 7 , 3 , 3 ]
NOCC  = [ 1 , 0 , 0 , 0 , 0 , 1 , 0 , 0 ]
NDMRG = [ 4 , 0 , 1 , 1 , 0 , 4 , 1 , 1 ]
NVIRT = [ 2 , 1 , 2 , 2 , 1 , 2 , 2 , 2 ]
DMRGSCF::setupStart : Number of variables in the x-matrix = 36
[New Thread 0x7ffff7f4a700 (LWP 31761)]
[New Thread 0x7fffeeb71780 (LWP 31762)]
[New Thread 0x7fffee770800 (LWP 31763)]
[New Thread 0x7fffee36f880 (LWP 31764)]
[New Thread 0x7fffedf6e900 (LWP 31765)]
[New Thread 0x7fffedb6d980 (LWP 31766)]
[New Thread 0x7fffed76ca00 (LWP 31767)]
[New Thread 0x7fffed36ba80 (LWP 31768)]
   EdmistonRuedenberg::Optimize : Cost function at start = 6.1701539185301
                                  Cost function at stop  = 6.99966515471458
                                  Gradient norm = 2.26215573786153e-09 after 6 iterations.
   EdmistonRuedenberg::FiedlerExchange : Cost function at start = 0.317120286043243
   EdmistonRuedenberg::FiedlerExchange : Cost function at end   = 0.23855195861958
DMRGSCF::solve : Rotated the active space to localized orbitals, sorted according to the exchange matrix.

   CheMPS2: a spin-adapted implementation of DMRG for ab initio quantum chemistry
   Copyright (C) 2013-2016 Sebastian Wouters

   This program is free software; you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation; either version 2 of the License, or
   (at your option) any later version.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

   You should have received a copy of the GNU General Public License along
   with this program; if not, write to the Free Software Foundation, Inc.,
   51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

   Stats: nIt(DAVIDSON) = 8

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffed76ca00 (LWP 31767)]
__kmpc_critical_with_hint (loc=0x7ffff7dcd9d0 <.2.3499_2_kmpc_loc_struct_pack.318>, global_tid=6, crit=0x7ffff929f7c0, hint=0) at ../../src/kmp_csupport.c:1222
1222    ../../src/kmp_csupport.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.1.x86_64 libgcc-4.8.5-4.el7.x86_64 libstdc++-4.8.5-4.el7.x86_64
(gdb) bt
#0  __kmpc_critical_with_hint (loc=0x7ffff7dcd9d0 <.2.3499_2_kmpc_loc_struct_pack.318>, global_tid=6, crit=0x7ffff929f7c0, hint=0) at ../../src/kmp_csupport.c:1222
#1  0x00007ffff7afbbd6 in CheMPS2::Sobject::Split(CheMPS2::TensorT*, CheMPS2::TensorT*, int, bool, bool) () from /theoryfs2/ds/cdsgroup/psi4-compile/externals/CheMPS2/objdir1/CheMPS2/libchemps2.so.2
#2  0x00007ffff3d949b3 in __kmp_invoke_microtask () from /theoryfs2/common/software/intel2016/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libiomp5.so
#3  0x00007ffff3d639a7 in __kmp_invoke_task_func (gtid=-136521264) at ../../src/kmp_runtime.c:7098
#4  0x00007ffff3d63095 in __kmp_launch_thread (this_thr=0x7ffff7dcd9d0 <.2.3499_2_kmpc_loc_struct_pack.318>) at ../../src/kmp_runtime.c:5715
#5  0x00007ffff3d94d33 in __kmp_launch_worker (thr=0x7ffff7dcd9d0 <.2.3499_2_kmpc_loc_struct_pack.318>) at ../../src/z_Linux_util.c:769
#6  0x00007ffff2abcdc5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007ffff27ea21d in clone () from /lib64/libc.so.6
(gdb) 
loriab commented 7 years ago

Also, I'm not explicitly running any of these in parallel, just normal make test or ctest setup. There's no threads-related environment variables set. And an explicit serial run OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 ./chemps2 --file=test14.input segfaults in the same way.

SebWouters commented 7 years ago

Hi @loriab

My kernel version is Linux 4.4.0-34-generic #53-Ubuntu SMP Wed Jul 27 16:06:39 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux, which is a bit newer, but which shouldn't matter at all.

Same thing with the intel compiler version: 16.0.3.20160415. I never ran 16.0.2, only 16.0.1 and 16.0.3, so perhaps there is a bug inside 16.0.2. Again something extremely unlikely.

In CheMPS2::Sobject::Split, there are only two omp directives: https://github.com/SebWouters/CheMPS2/blob/e63f11c6cf53a6434ae9a1f5278ea218517e2f2f/CheMPS2/Sobject.cpp#L318 https://github.com/SebWouters/CheMPS2/blob/e63f11c6cf53a6434ae9a1f5278ea218517e2f2f/CheMPS2/Sobject.cpp#L415

namely

#pragma omp parallel for schedule(dynamic)
#pragma omp critical

which are as plain vanilla as you can imagine. So if @wpoely86 also doesn't have a clue either, I suggest you post a bug report against the intel compiler, as versions pre-16, 16.0.1, and 16.0.3 work perfectly fine, as well as the gcc4.8, gcc5, and clang3.8... They are usually quite responsive.

Best wishes, Sebastian

SebWouters commented 7 years ago

Check footnote 6 on page 3 of http://graal.ens-lyon.fr/~desprez/WS/PBIO/Euro-Par_2016_WS_paper_79.pdf.

loriab commented 7 years ago

Just acquired and tried the newer Intel 2016.0.3. Fault still present.

I'm still wondering if the hdf5 differences could have anything to do with it. From my libdff5.settings file:

Features:
---------
                  Parallel HDF5: no
             High Level library: yes
                   Threadsafety: no

What does yours say? Are you building it threadsafe or using the Ubuntu repo?

SebWouters commented 7 years ago

Hi @loriab,

On my laptop, I'm just using the ubuntu repo (serial version). I have made explicitly sure that HDF5 is called only once per process (i.e. in non-threaded parts of the code). And when multiple processes run, either different files are used for the processes, or the master reads it in and then broadcasts.

On the UGent HPC, it's a custom build of HDF5 and CheMPS2. The easybuild files can be found here: https://github.com/hpcugent/easybuild-easyconfigs/tree/master/easybuild/easyconfigs/h/HDF5 and https://github.com/hpcugent/easybuild-easyconfigs/blob/master/easybuild/easyconfigs/c/CheMPS2. In the beginning, I always used the serial versions. Recently, the HPC team has started using the regular (parallel) builds, and everything seems to work fine as well.

In the Sobject class, no use of HDF5 is made anyway.

It seems to be a nasty bug to catch... If you compile with low optimization flags, and debugging symbols on, does the bug persist? Which line does it crash on?

Best wishes, Sebastian

SebWouters commented 7 years ago

Hi @loriab,

gdb starts it output with

#0  __kmpc_critical_with_hint

I guess that it crashes exactly during the lines 414-417 of https://github.com/SebWouters/CheMPS2/blob/07fc50846f1d5943b9a1de40dcc9951573205cf3/CheMPS2/Sobject.cpp#L414.

Does the error persist if you just comment out the "#pragma omp critical" line? For MKL that's fine. It's there for ATLAS.

Best wishes, Sebastian

loriab commented 7 years ago

Thanks, commenting out that #pragma omp critical did fix the problem, so I set up that line not to compile for MKL via https://github.com/loriab/CheMPS2/commit/ed4470f07df4eebc534528c645b76721f3177efa . That commit will close the ticket.

SebWouters commented 7 years ago

Hi @loriab,

Looks good! Perhaps change the flag to -DCHEMPS2_MKL, as -DMKL might interfere with other code's options at some point?

If you make a PR to the current head, I'll accept it :-).

Best wishes, Sebastian

loriab commented 7 years ago

Thanks. I went ahead and prepared a separate PR. The -DMKL could have avoided interfering with other code's options by setting it PRIVATE though target_compile_definitions, I think. But I switched it to -DCHEMPS2_MKL anyways, so all should be in order.