Open tkoskela opened 2 months ago
On my Ubuntu-22.04 laptop with gcc version 11.4.0
on the develop
branch commit 78325c916b68649b46524dfee7fadf1127c4299c I get
test_001_bulk_Si_1proc_Diag/Conquest_out: |* Harris-Foulkes energy = -33.679289916138416 Ha
test_002_bulk_Si_1proc_OrderN/Conquest_out: |* Harris-Foulkes energy = -33.569389500697085 Ha
test_003_bulk_BTO_polarisation/Conquest_out: |* Harris-Foulkes energy = -136.657600397396351 Ha
test_001_bulk_Si_1proc_Diag/Conquest_out: |* Harris-Foulkes energy = -33.679289916138359 Ha
test_002_bulk_Si_1proc_OrderN/Conquest_out: |* Harris-Foulkes energy = -33.569389501217962 Ha
test_003_bulk_BTO_polarisation/Conquest_out: |* Harris-Foulkes energy = -136.657600376430253 Ha
test_001_bulk_Si_1proc_Diag/Conquest_out: |* Harris-Foulkes energy = -33.679289916138323 Ha
test_002_bulk_Si_1proc_OrderN/Conquest_out: |* Harris-Foulkes energy = -33.569389497534459 Ha
test_003_bulk_BTO_polarisation/Conquest_out: |* Harris-Foulkes energy = -136.657601071024885 Ha
The largest relative differences between these are in the order of 1e-9, so the develop branch seems to be ok.
On myriad, with the current develop branch I'm getting a segfault in test_001
with 1 MPI process, 2 runs fine
Also seems related to MPI_alltoallv
$ mpirun -np 1 ../../bin/Conquest
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
Conquest 000000000080EC7A for__signal_handl Unknown Unknown
libpthread-2.17.s 00002B9D1547B630 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B9D17CB2A28 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B9D17A60A8B Unknown Unknown Unknown
libmpi.so.12.0.0 00002B9D17AAD56B Unknown Unknown Unknown
libmpi.so.12.0.0 00002B9D17A8C780 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B9D17B8E36F Unknown Unknown Unknown
libmpi.so.12.0.0 00002B9D17A61A55 PMPI_Alltoallv Unknown Unknown
libmpifort.so.12. 00002B9D17650DDA pmpi_alltoallv__ Unknown Unknown
Conquest 00000000004CAEC3 Unknown Unknown Unknown
Conquest 00000000004E53C0 Unknown Unknown Unknown
Conquest 00000000004E4B80 Unknown Unknown Unknown
Conquest 00000000004CFF85 Unknown Unknown Unknown
Conquest 00000000004F9A45 Unknown Unknown Unknown
Conquest 00000000004F7FC0 Unknown Unknown Unknown
Conquest 00000000005F30AE Unknown Unknown Unknown
Conquest 00000000005F8952 Unknown Unknown Unknown
Conquest 0000000000411522 Unknown Unknown Unknown
Conquest 0000000000411492 Unknown Unknown Unknown
libc-2.17.so 00002B9D1969B555 __libc_start_main Unknown Unknown
Conquest 00000000004113A9 Unknown Unknown Unknown
[cceaosk@node-d97a-005 test_001_bulk_Si_1proc_Diag]$ mpirun -np 2 ../../bin/Conquest
[cceaosk@node-d97a-005 test_001_bulk_Si_1proc_Diag]$
Built with
[cceaosk@node-d97a-005 src]$ module list
Currently Loaded Modulefiles:
1) beta-modules 5) libxc/6.2.2/intel-2022 9) userscripts/1.4.0 13) python3/3.11
2) gcc-libs/10.2.0 6) gerun 10) openssl/1.1.1u
3) compilers/intel/2022.2 7) git/2.41.0-lfs-3.3.0 11) python/3.11.4
4) mpi/intel/2021.6.0/intel 8) emacs/28.1 12) openblas/0.3.7-serial/gnu-4.9.2
and system.myriad.make
:
# Set compilers
FC=mpif90
F77=mpif77
# OpenMP flags
# Set this to "OMPFLAGS= " if compiling without openmp
# Set this to "OMPFLAGS= -fopenmp" if compiling with openmp
OMPFLAGS= -fopenmp
# Set BLAS and LAPACK libraries
# MacOS X
# BLAS= -lvecLibFort
# Intel MKL use the Intel tool
# Generic
#BLAS= -llapack -lblas
# LibXC: choose between LibXC compatibility below or Conquest XC library
# Conquest XC library
#XC_LIBRARY = CQ
#XC_LIB =
#XC_COMPFLAGS =
# LibXC compatibility
# Choose LibXC version: v4 (deprecated) or v5/6 (v5 and v6 have the same interface)
# XC_LIBRARY = LibXC_v4
#XC_LIB = -L/shared/ucl/apps/libxc/4.2.3/intel-2018/lib -lxcf90 -lxc
#XC_COMPFLAGS = -I/shared/ucl/apps/libxc/4.2.3/intel-2018/include
XC_LIBRARY = LibXC_v5
XC_LIB = -lxcf90 -lxc
XC_COMPFLAGS = -I/usr/local/include
# Compilation flags
# NB for gcc10 you need to add -fallow-argument-mismatch
COMPFLAGS= -O3 -g $(OMPFLAGS) $(XC_COMPFLAGS) -I"${MKLROOT}/include"
COMPFLAGS_F77= $(COMPFLAGS)
# Set FFT library
FFT_LIB=-lmkl_rt
FFT_OBJ=fft_fftw3.o
# Full library call; remove scalapack if using dummy diag module
# If using OpenMPI, use -lscalapack-openmpi instead.
#LIBS= $(FFT_LIB) $(XC_LIB) -lscalapack $(BLAS)
LIBS= $(FFT_LIB) $(XC_LIB)
# Linking flags
LINKFLAGS= -L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_cdft_core -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -ldl $(OMPFLAGS) $(XC_LIB)
ARFLAGS=
# Matrix multiplication kernel type
MULT_KERN = ompGemm_m
# Use dummy DiagModule or not
DIAG_DUMMY =
# Use dummy omp_module or not.
# Set this to "OMP_DUMMY = DUMMY" if compiling without openmp
# Set this to "OMP_DUMMY = " if compiling with openmp
OMP_DUMMY =
TODO: @tkoskela to test if this happens on ARCHER2
Ran benchmarks/matrix_multiply and tests 001 002 on Archer2 with 1 and 2 MPI ranks. No segfaults.
On Archer2 I build using cray-mpich
so possibly this is an Intel MPI related bug?
I activated the CI for f-exx-opt
and in test_004
there is a 1e-5 relative difference in the output when running on 1 MPI process, which is causing the test to fail.
https://github.com/OrderN/CONQUEST-release/actions/runs/8815200505/job/24208939487
There's possibly a bug in the MPI communication which appears when running on one process. Collecting hints in this issue
In
test_004
off-exx-opt
we notice a difference in the order of 1e-5 in the Harris-Foulkes energy when running on one MPI process, compared to running on multiple processes. In conversation with @lionelalexandre it came up he has been aware of this for some time. Other tests in the testsuite have a tolerance of 1e-4, so they might be missing this.When running the code in the DDT debugger on myriad with one MPI process, we get a segfault in https://github.com/OrderN/CONQUEST-release/blob/6bf8f4a8c20fd4fa8f1c7baeb8a6b1f23a6d2408/src/generic_comms.f90#L1780-L1782 I haven't yet found an obvious reason why.
MPI_alltoallv
is complicated. Obviously on 1 process it should be doing nothing.