tkoskela commented 2 months ago

There's possibly a bug in the MPI communication which appears when running on one process. Collecting hints in this issue

In test_004 of f-exx-opt we notice a difference in the order of 1e-5 in the Harris-Foulkes energy when running on one MPI process, compared to running on multiple processes. In conversation with @lionelalexandre it came up he has been aware of this for some time. Other tests in the testsuite have a tolerance of 1e-4, so they might be missing this.

When running the code in the DDT debugger on myriad with one MPI process, we get a segfault in https://github.com/OrderN/CONQUEST-release/blob/6bf8f4a8c20fd4fa8f1c7baeb8a6b1f23a6d2408/src/generic_comms.f90#L1780-L1782 I haven't yet found an obvious reason why. MPI_alltoallv is complicated. Obviously on 1 process it should be doing nothing.

tkoskela commented 2 months ago

On my Ubuntu-22.04 laptop with gcc version 11.4.0 on the develop branch commit 78325c916b68649b46524dfee7fadf1127c4299c I get

One MPI process

test_001_bulk_Si_1proc_Diag/Conquest_out:      |* Harris-Foulkes energy   =       -33.679289916138416 Ha
test_002_bulk_Si_1proc_OrderN/Conquest_out:      |* Harris-Foulkes energy   =       -33.569389500697085 Ha
test_003_bulk_BTO_polarisation/Conquest_out:      |* Harris-Foulkes energy   =      -136.657600397396351 Ha

Two MPI processes

test_001_bulk_Si_1proc_Diag/Conquest_out:      |* Harris-Foulkes energy   =       -33.679289916138359 Ha
test_002_bulk_Si_1proc_OrderN/Conquest_out:      |* Harris-Foulkes energy   =       -33.569389501217962 Ha
test_003_bulk_BTO_polarisation/Conquest_out:      |* Harris-Foulkes energy   =      -136.657600376430253 Ha

Four MPI processes

test_001_bulk_Si_1proc_Diag/Conquest_out:      |* Harris-Foulkes energy   =       -33.679289916138323 Ha
test_002_bulk_Si_1proc_OrderN/Conquest_out:      |* Harris-Foulkes energy   =       -33.569389497534459 Ha
test_003_bulk_BTO_polarisation/Conquest_out:      |* Harris-Foulkes energy   =      -136.657601071024885 Ha

The largest relative differences between these are in the order of 1e-9, so the develop branch seems to be ok.

tkoskela commented 1 month ago

On myriad, with the current develop branch I'm getting a segfault in test_001 with 1 MPI process, 2 runs fine

Also seems related to MPI_alltoallv

$ mpirun -np 1 ../../bin/Conquest
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
Conquest           000000000080EC7A  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00002B9D1547B630  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B9D17CB2A28  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B9D17A60A8B  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B9D17AAD56B  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B9D17A8C780  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B9D17B8E36F  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B9D17A61A55  PMPI_Alltoallv        Unknown  Unknown
libmpifort.so.12.  00002B9D17650DDA  pmpi_alltoallv__      Unknown  Unknown
Conquest           00000000004CAEC3  Unknown               Unknown  Unknown
Conquest           00000000004E53C0  Unknown               Unknown  Unknown
Conquest           00000000004E4B80  Unknown               Unknown  Unknown
Conquest           00000000004CFF85  Unknown               Unknown  Unknown
Conquest           00000000004F9A45  Unknown               Unknown  Unknown
Conquest           00000000004F7FC0  Unknown               Unknown  Unknown
Conquest           00000000005F30AE  Unknown               Unknown  Unknown
Conquest           00000000005F8952  Unknown               Unknown  Unknown
Conquest           0000000000411522  Unknown               Unknown  Unknown
Conquest           0000000000411492  Unknown               Unknown  Unknown
libc-2.17.so       00002B9D1969B555  __libc_start_main     Unknown  Unknown
Conquest           00000000004113A9  Unknown               Unknown  Unknown
[cceaosk@node-d97a-005 test_001_bulk_Si_1proc_Diag]$ mpirun -np 2 ../../bin/Conquest
[cceaosk@node-d97a-005 test_001_bulk_Si_1proc_Diag]$

Built with

[cceaosk@node-d97a-005 src]$ module list
Currently Loaded Modulefiles:
  1) beta-modules                      5) libxc/6.2.2/intel-2022            9) userscripts/1.4.0                13) python3/3.11
  2) gcc-libs/10.2.0                   6) gerun                            10) openssl/1.1.1u
  3) compilers/intel/2022.2            7) git/2.41.0-lfs-3.3.0             11) python/3.11.4
  4) mpi/intel/2021.6.0/intel          8) emacs/28.1                       12) openblas/0.3.7-serial/gnu-4.9.2

and system.myriad.make:

# Set compilers
FC=mpif90
F77=mpif77

# OpenMP flags
# Set this to "OMPFLAGS= " if compiling without openmp
# Set this to "OMPFLAGS= -fopenmp" if compiling with openmp
OMPFLAGS= -fopenmp

# Set BLAS and LAPACK libraries
# MacOS X
# BLAS= -lvecLibFort
# Intel MKL use the Intel tool
# Generic
#BLAS= -llapack -lblas

# LibXC: choose between LibXC compatibility below or Conquest XC library

# Conquest XC library
#XC_LIBRARY = CQ
#XC_LIB =
#XC_COMPFLAGS =

# LibXC compatibility
# Choose LibXC version: v4 (deprecated) or v5/6 (v5 and v6 have the same interface)
# XC_LIBRARY = LibXC_v4
#XC_LIB = -L/shared/ucl/apps/libxc/4.2.3/intel-2018/lib -lxcf90 -lxc
#XC_COMPFLAGS = -I/shared/ucl/apps/libxc/4.2.3/intel-2018/include
XC_LIBRARY = LibXC_v5
XC_LIB = -lxcf90 -lxc
XC_COMPFLAGS = -I/usr/local/include

# Compilation flags
# NB for gcc10 you need to add -fallow-argument-mismatch
COMPFLAGS= -O3 -g $(OMPFLAGS) $(XC_COMPFLAGS) -I"${MKLROOT}/include"
COMPFLAGS_F77= $(COMPFLAGS)

# Set FFT library
FFT_LIB=-lmkl_rt
FFT_OBJ=fft_fftw3.o

# Full library call; remove scalapack if using dummy diag module
# If using OpenMPI, use -lscalapack-openmpi instead.
#LIBS= $(FFT_LIB) $(XC_LIB) -lscalapack $(BLAS)
LIBS= $(FFT_LIB) $(XC_LIB)

# Linking flags
LINKFLAGS= -L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_cdft_core -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -ldl $(OMPFLAGS) $(XC_LIB)
ARFLAGS=

# Matrix multiplication kernel type
MULT_KERN = ompGemm_m
# Use dummy DiagModule or not
DIAG_DUMMY =
# Use dummy omp_module or not.
# Set this to "OMP_DUMMY = DUMMY" if compiling without openmp
# Set this to "OMP_DUMMY = " if compiling with openmp
OMP_DUMMY =

tkoskela commented 1 week ago

TODO: @tkoskela to test if this happens on ARCHER2

tkoskela commented 1 week ago

Ran benchmarks/matrix_multiply and tests 001 002 on Archer2 with 1 and 2 MPI ranks. No segfaults.

On Archer2 I build using cray-mpich so possibly this is an Intel MPI related bug?

tkoskela commented 1 week ago

I activated the CI for f-exx-opt and in test_004 there is a 1e-5 relative difference in the output when running on 1 MPI process, which is causing the test to fail.

https://github.com/OrderN/CONQUEST-release/actions/runs/8815200505/job/24208939487

OrderN / CONQUEST-release

Possible bug when running on one MPI process #321

One MPI process

Two MPI processes

Four MPI processes