cp2k / cp2k

Quantum chemistry and solid state physics software package
https://www.cp2k.org
GNU General Public License v2.0
832 stars 383 forks source link

dbcsr crashes with too many communicators in nmr calculation #2423

Open recohen opened 1 year ago

recohen commented 1 year ago

For the latest development code I get:


I have tried different number of processors but it usually fails with this error in dbcsr .

Ron Cohen

alazzaro commented 1 year ago

Which MPI is this? Does it fail at the start-up of the application or after a while? Can you share an input we can try?

alazzaro commented 1 year ago

Duplicate of https://github.com/cp2k/cp2k/issues/546

recohen commented 1 year ago

Thanks! I don't know if this is a duplicate of #546 because that was fixed and closed. This is with intel mpi . The code was compiled wityh 20.1, but is being run under oneapi. I wonder if that is the problem.

recohen commented 1 year ago

To clarify, I was loading the correct 20.1 module, but I still had an environment setting for oneapi. I have remove the oneapi line now and will see if that helps.

module load intel/20.1 source /central/software/Intel/oneapi/setvars.sh intel64

recohen commented 1 year ago

OK--it seems on this machine I have to load the oneapi runtime for the job to start even though compiled with intel/20.1.

recohen commented 1 year ago

This seems to be a seriuos problem--almostt none of my jobs complete even when I don't use oneapi at all. I get:

INRES| Writing response functions to the restart file LINRES| Writing response functions to the restart file

               *** Start NMR Chemical Shift Calculation ***

     Inizialization of the NMR environment

NMR| Shift gapw radius (a.u.) 1.133836E+02 NMR| Shift factor (ppm) 1.366192E-02 NMR| Shift factor gapw (ppm) 5.325134E+01 NMR| Chi factor (SI) 1.972757E+01 NMR| Conversion Chi (ppm/cgs) 6.022045E-02 NMR| Conversion Chi to Shift 1.450429E-03 NMR| Shielding tensor computed for 162 atoms Integrated j_xx(r): G-space= -0.9726573865204186E+03 R-space= -0.9726573865204186E+03 Integrated j_yx(r): G-space= -0.5214930267259292E+03 R-space= -0.5214930267259292E+03 Integrated j_zx(r): G-space= -0.8036280255394096E+03 R-space= -0.8036280255394095E+03 calculate_jrho_atom_coeff: nbr_dbl=0.33E+06 calculate_jrho_atom_coeff: nbr_dbl=0.33E+06 calculate_jrho_atom_coeff: nbr_dbl=0.33E+06 CheckSum R-integrated j= 0.1365223561146258E+04


This is using: source /central/software/Intel/2020.1/compilers_and_libraries_2020/linux/mpi/intel64/bin/mpivars.sh

Ron

alazzaro commented 1 year ago

I have no clue... I can suggest to re-read my suggestions at https://github.com/cp2k/cp2k/issues/546#issuecomment-612525643 , especially to print how many sub-communications we are creating. Another solution is to try another MPI implementation and see if it works.

recohen commented 1 year ago

It is very strange. I first just added a print statement as you recommended but the number of communicators seemed to be only 4 or 5 and made too much output. So instead I added the following to diff --git a/src/mpi/dbcsr_mpiwrap.F b/src/mpi/dbcsr_mpiwrap.F index 9785d41..cd43c7d 100644 --- a/src/mpi/dbcsr_mpiwrap.F +++ b/src/mpi/dbcsr_mpiwrap.F @@ -1030,6 +1030,9 @@ CONTAINS INTEGER, INTENT(IN) :: comm LOGICAL, DIMENSION(:), CONTIGUOUS, INTENT(IN) :: rdim INTEGER, INTENT(OUT) :: sub_comm +#if defined(__parallel)

@@ -1040,6 +1043,10 @@ CONTAINS

   sub_comm = 0

if defined(__parallel)

It seems this should print if the number of communicators goes over 100. Nevertheless the jobs fails saying there are 0 communicators left:

189        PCG       F         0.16E-01      0.0000000830        8.76
  190        PCG       F         0.11E-02      0.0000000268        8.80

Ron

alazzaro commented 1 year ago

The error is at line 775 when it is freeing the communicators? I wonder if there is a cumulative limit on that... On the other hand, maybe we can think a way to cache those subcommunicators rather then creating/destroying every time. Any way you can share the CP2k input for reproducing the problem?

recohen commented 1 year ago

Here is the input. Note that the failure occurs with more processors--like 256 or 128. With 64 it might work OK. Thank you!

Sincerely,

Ron toomanycomm2.tar.gz