OrderN / CONQUEST-release

Full public release of large scale and linear scaling DFT code CONQUEST
http://www.order-n.org/
MIT License
94 stars 24 forks source link

Possible issue with gcc@13 #301

Closed connoraird closed 3 months ago

connoraird commented 3 months ago

After compiling with gcc version 13 on Mac I get the following error running test_001...

There may be other issues with my setup. However, we thought it best to raise an issue anyway.

Error in process    1
  make_halo: no. of atoms in halo must be .ge. 1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
  Proc: [[59652,0],0]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
tsuyoshi38 commented 3 months ago

If you use "-O3" (defined in system/system.*.make) in compiling, could you try with "-g" ?
In my case, I also encountered an error though it looks different. But, I don't have this problem with -g.

davidbowler commented 3 months ago

Can you give us a little more information about the setup, compilation and output, please? For reference, I compiled with GCC13.2 and OpenMPI 4.1.6 on a Mac (running Ventura 13.6.3) with compilers installed via homebrew, using -O3 and linking against FFTW v3.3.10 and LibXC v6.2.2 (also from home-brew). I used the current version of Conquest on the develop branch (output gives Version comment: Git Branch: develop; tag, hash: v1.2-156-ge1759e68). I ran on one process (mpirun -np 1) and found no problems running.

Can you attach your input files, system.make, and output file as well as giving details of how you run if you want more help?

connoraird commented 3 months ago

Thanks for the help, I've tried replicating your setup @davidbowler. I have checked out the most recent develop branch (commit e1759e68c19649eccc8fe1098ee0714f9c628347). I am using GCC13.2 installed via homebrew and I was using openmpi 5.0.1 installed via homebrew but I've now also tried openmpi 4.1.6 installed with MacPorts. Both have resulted in the same error stated above. I am running the command mpirun -np 1 ../../bin/Conquest from within testsuite/test_001_bulk_Si_1proc_Diag. My system.*.make file is printed below.

# This is an system-specific makefile for my local system. You will need to adjust
# it for the actual system you are running on.

# Set compilers
FC=mpif90
F77=mpif77

# OpenMP flags
# Set this to "OMPFLAGS= " if compiling without openmp
# Set this to "OMPFLAGS= -fopenmp" if compiling with openmp
OMPFLAGS= -fopenmp

# Compilation flags
# NB for gcc10 you need to add -fallow-argument-mismatch
COMPFLAGS= -O3 $(OMPFLAGS) $(XC_COMPFLAGS) -fallow-argument-mismatch
COMPFLAGS_F77= $(COMPFLAGS)

# Set BLAS and LAPACK libraries
# MacOS X
BLAS= -lvecLibFort
# Intel MKL use the Intel tool
# Generic
# BLAS= -llapack -lblas

# LibXC: choose between LibXC compatibility below or Conquest XC library

# Conquest XC library
#XC_LIBRARY = CQ
#XC_LIB =
#XC_COMPFLAGS =

# LibXC compatibility
# Choose LibXC version: v4 (deprecated) or v5/6 (v5 and v6 have the same interface)
#XC_LIBRARY = LibXC_v4
XC_LIBRARY = LibXC_v5
XC_LIB = -lxcf90 -lxc
XC_COMPFLAGS = -I/usr/local/include -I/opt/local/include -I/opt/homebrew/include -I/opt/homebrew/Cellar/libxc/6.2.2/include -I/opt/homebrew/Cellar/fftw/3.3.10_1/include

# Set FFT library
FFT_LIB=-lfftw3
FFT_OBJ=fft_fftw3.o

# Full library call; remove -lscalapack if using dummy diag module.
# If using OpenMPI, use -lscalapack-openmpi instead.
LIBS= $(FFT_LIB) $(XC_LIB) -lscalapack $(BLAS)

# Linking flags
LINKFLAGS= -L/usr/local/lib -L/opt/local/lib -L/opt/homebrew/lib -L/opt/homebrew/Cellar/libxc/6.2.2/lib -L/opt/homebrew/Cellar/fftw/3.3.10_1/lib $(OMPFLAGS)
ARFLAGS=

# Matrix multiplication kernel type
MULT_KERN = default
# Use dummy DiagModule or not
DIAG_DUMMY = 
# Use dummy omp_module or not.
# Set this to "OMP_DUMMY = DUMMY" if compiling without openmp
# Set this to "OMP_DUMMY = " if compiling with openmp
OMP_DUMMY = 
davidbowler commented 3 months ago

That's very odd @connoraird ! Can you upload the output file (if it is produced)?

connoraird commented 3 months ago

Sure, The Conquest_out file is as follows

    ________________________________________________________________________

                                    CONQUEST                                

                Concurrent Order N QUantum Electronic STructure             
    ________________________________________________________________________

     Conquest lead developers:                                              
      D.R.Bowler (UCL, NIMS), T.Miyazaki (NIMS), A.Nakata (NIMS),           
      L. Truflandier (U. Bordeaux)                                          

     Developers:                                                            
      M.Arita (NIMS), J.S.Baker (UCL), V.Brazdova (UCL), R.Choudhury (UCL), 
      S.Y.Mujahed (UCL), J.T.Poulton (UCL), Z.Raza (NIMS), A.Sena (UCL),    
      U.Terranova (UCL), L.Tong (UCL), A.Torralba (NIMS)                    

     Early development:                                                     
      I.J.Bush (STFC), C.M.Goringe (Keele), E.H.Hernandez (Keele)           

     Original inspiration and project oversight:                            
      M.J.Gillan (Keele, UCL)                                               
    ________________________________________________________________________

      Simulation cell dimensions:    10.3600 a0 x    10.3600 a0 x    10.3600 a0

      Atomic coordinates (a0)
         Atom         X         Y         Z  Species
            1    0.0104    0.0207    0.0311        1
            2    5.1800    5.1800    0.0000        1
            3    5.1800    0.0000    5.1800        1
            4    0.0000    5.1800    5.1800        1
            5    2.5900    2.5900    2.5900        1
            6    7.7700    7.7700    2.5900        1
            7    2.5900    7.7700    7.7700        1
            8    7.7700    2.5900    7.7700        1
    Using a MP mesh for k-points:   2 x   2 x   2 G

    This job was run on  2024/01/11 at 10:46 +0000
    Code was compiled on 2024/01/11 at 10:34 +0000
    Version comment: Git Branch: develop; tag, hash: v1.2-156-ge1759e68

    Job title:                                                                                 
    Job to be run: static calculation

    Ground state search:
      Support functions represented with PAO basis
      1:1 PAO to SF mapping
      Non-spin-polarised electrons
      Solving for the K matrix using diagonalisation 

    Integration grid spacing:  0.288 a0 x 0.288 a0 x 0.288 a0

    Number of species:  1
    --------------------------------------------------------
    |   #  mass (au)  Charge (e)  SF Rad (a0)  NSF  Label  |
    --------------------------------------------------------
    |   1     28.086       4.000        0.576    9  Si     |
    --------------------------------------------------------

    The calculation will be performed on     1 process

    The calculation will be performed on     8 threads

    Using the default matrix multiplication kernel

    The functional used will be GGA PBE96      

  Error in process    1
  make_halo: no. of atoms in halo must be .ge. 1
davidbowler commented 3 months ago

There is a subtle issue with GCC13 we have been chasing intermittently which might cause this. Can you try compiling pseudo_tm_info.f90 without any optimisation: mpif90 -g -c pseudo_tm_info.f90 and then remake the rest of the code and see if that helps? Also worth trying with one thread only (or even without OpenMP).

connoraird commented 3 months ago

After compiling pseudo_tm_info.f90 and remaking, as you said, and running ./run_conquest_tests.sh I get the following

Running tests on 1 processes and 1 threads
Building on SYSTEM local, using makefile system/system.local.make
make: Nothing to be done for `default'.
  Error in process    1
  We need Conquest_input to run !
  Error in process    1
  We need Conquest_input to run !
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
  Proc: [[40839,0],0]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
  Error in process    1
  We need Conquest_input to run !
  Error in process    1
  We need Conquest_input to run !
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
  Proc: [[31203,0],0]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
============================================================================================================================= test session starts ==============================================================================================================================
platform darwin -- Python 3.11.6, pytest-7.4.3, pluggy-1.3.0
rootdir: /Users/connoraird/work/conquest/CONQUEST-release/testsuite
plugins: anyio-4.0.0
collected 12 items

test_check_output.py ............                                                                                                                                                                                                                                        [100%]

============================================================================================================================== 12 passed in 0.12s ==============================================================================================================================
connoraird commented 3 months ago

My issue has been resolved through @davidbowler's suggestion of compiling pseudo_tm_info.f90 without optimisations, before remaking.

The other errors displayed above where due to directories I had failed to cleanup before running the tests.

tkoskela commented 3 months ago

Related to #289

davidbowler commented 3 months ago

Fixed by #302