recohen commented 1 year ago

For the latest development code I get:

MPI error 672736783 in mpi_cart_sub @ mp_cart_sub : Other MPI *
___ error, error stack: PMPI_Cart_sub(224)..................: *
/ \ MPI_Cart_sub(comm=0xc404000d, remain_dims=0x7f200d285080, *
[ABORT] comm_new=0x2acb5cba9ba4) *
___/ failed PMPI_Cart_sub(162)..................: *
|
MPIR_Comm_split_impl(246)...........: *
O/|
MPIR_Get_contextid_sparse_group(602): Too many communicators *
/| | (0/32768 free on this process; ignore_id=0) *

/ \ dbcsr_mpiwrap.F:775 *

===== Routine Calling Stack =====

   14 mp_cart_sub
   13 dbcsr_complete_redistribute
   12 copy_dbcsr_to_fm
   11 cp_dbcsr_sm_fm_multiply
   10 apply_op_1
    9 apply_op
    8 linres_solver
    7 current_response
    6 linres_calculation_low
    5 qs_energies_properties
    4 qs_energies
    3 velocity_verlet
    2 qs_mol_dyn_low
    1 CP2K

Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

I have tried different number of processors but it usually fails with this error in dbcsr .

Ron Cohen

alazzaro commented 1 year ago

Which MPI is this? Does it fail at the start-up of the application or after a while? Can you share an input we can try?

alazzaro commented 1 year ago

Duplicate of https://github.com/cp2k/cp2k/issues/546

recohen commented 1 year ago

Thanks! I don't know if this is a duplicate of #546 because that was fixed and closed. This is with intel mpi . The code was compiled wityh 20.1, but is being run under oneapi. I wonder if that is the problem.

recohen commented 1 year ago

To clarify, I was loading the correct 20.1 module, but I still had an environment setting for oneapi. I have remove the oneapi line now and will see if that helps.

module load intel/20.1 source /central/software/Intel/oneapi/setvars.sh intel64

recohen commented 1 year ago

OK--it seems on this machine I have to load the oneapi runtime for the job to start even though compiled with intel/20.1.

recohen commented 1 year ago

This seems to be a seriuos problem--almostt none of my jobs complete even when I don't use oneapi at all. I get:

INRES| Writing response functions to the restart file LINRES| Writing response functions to the restart file

               *** Start NMR Chemical Shift Calculation ***

     Inizialization of the NMR environment

NMR| Shift gapw radius (a.u.) 1.133836E+02 NMR| Shift factor (ppm) 1.366192E-02 NMR| Shift factor gapw (ppm) 5.325134E+01 NMR| Chi factor (SI) 1.972757E+01 NMR| Conversion Chi (ppm/cgs) 6.022045E-02 NMR| Conversion Chi to Shift 1.450429E-03 NMR| Shielding tensor computed for 162 atoms Integrated j_xx(r): G-space= -0.9726573865204186E+03 R-space= -0.9726573865204186E+03 Integrated j_yx(r): G-space= -0.5214930267259292E+03 R-space= -0.5214930267259292E+03 Integrated j_zx(r): G-space= -0.8036280255394096E+03 R-space= -0.8036280255394095E+03 calculate_jrho_atom_coeff: nbr_dbl=0.33E+06 calculate_jrho_atom_coeff: nbr_dbl=0.33E+06 calculate_jrho_atom_coeff: nbr_dbl=0.33E+06 CheckSum R-integrated j= 0.1365223561146258E+04

MPI error 471410959 in mpi_cart_sub @ mp_cart_sub : Other MPI *
___ error, error stack: PMPI_Cart_sub(222)..................: *
/ \ MPI_Cart_sub(comm=0xc403f3f4, remain_dims=0x7f200b9f2d80, *
[ABORT] comm_new=0x2ac7ed877684) *
___/ failed PMPI_Cart_sub(161)..................: *
|
MPIR_Comm_split_impl(246)...........: *
O/|
MPIR_Get_contextid_sparse_group(602): Too many communicators *
/| | (0/32768 free on this process; ignore_id=0) *
/ \ dbcsr_mpiwrap.F:775 *

===== Routine Calling Stack =====
```
   11 mp_cart_sub
   10 dbcsr_complete_redistribute
    9 copy_dbcsr_to_fm
    8 cp_dbcsr_sm_fm_multiply
    7 current_build_chi_many_centers
    6 linres_calculation_low
    5 qs_energies_properties
    4 qs_energies
    3 velocity_verlet
    2 qs_mol_dyn_low
    1 CP2K
```
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0 slurmstepd: error: STEP 30605317.0 ON hpc-21-14 CANCELLED AT 2022-12-01T22:56:42 :

This is using: source /central/software/Intel/2020.1/compilers_and_libraries_2020/linux/mpi/intel64/bin/mpivars.sh

Ron

alazzaro commented 1 year ago

I have no clue... I can suggest to re-read my suggestions at https://github.com/cp2k/cp2k/issues/546#issuecomment-612525643 , especially to print how many sub-communications we are creating. Another solution is to try another MPI implementation and see if it works.

recohen commented 1 year ago

It is very strange. I first just added a print statement as you recommended but the number of communicators seemed to be only 4 or 5 and made too much output. So instead I added the following to diff --git a/src/mpi/dbcsr_mpiwrap.F b/src/mpi/dbcsr_mpiwrap.F index 9785d41..cd43c7d 100644 --- a/src/mpi/dbcsr_mpiwrap.F +++ b/src/mpi/dbcsr_mpiwrap.F @@ -1030,6 +1030,9 @@ CONTAINS INTEGER, INTENT(IN) :: comm LOGICAL, DIMENSION(:), CONTIGUOUS, INTENT(IN) :: rdim INTEGER, INTENT(OUT) :: sub_comm +#if defined(__parallel)

INTEGER :: taskid,gid +#endif

CHARACTER(LEN=*), PARAMETER :: routineN = 'mp_cart_sub'

@@ -1040,6 +1043,10 @@ CONTAINS

   sub_comm = 0

if defined(__parallel)

if(debug_comm_count.gt.100)then
CALL mpi_comm_rank(gid, taskid, ierr)
write(,)'taskid,debug_comm_count:',taskid,debug_comm_count
endif CALL mpi_cart_sub(comm, rdim, sub_comm, ierr) IF (ierr /= 0) CALL mp_stop(ierr, "mpi_cart_sub @ "//routineN) debug_comm_count = debug_comm_count + 1

It seems this should print if the number of communicators goes over 100. Nevertheless the jobs fails saying there are 0 communicators left:

189        PCG       F         0.16E-01      0.0000000830        8.76
  190        PCG       F         0.11E-02      0.0000000268        8.80

MPI error 672737551 in mpi_cart_sub @ mp_cart_sub : Other MPI *
___ error, error stack: PMPI_Cart_sub(222)..................: *
/ \ MPI_Cart_sub(comm=0xc403f3f2, remain_dims=0x7f200b884580, *
[ABORT] comm_new=0x2b573521fb84) *
___/ failed PMPI_Cart_sub(161)..................: *
| MPIR_Comm_split_impl(246)...........: *
O/| MPIR_Get_contextid_sparse_group(602): Too many communicators * /| | (0/32768 free on this process; ignore_id=0)

/ \ dbcsr_mpiwrap.F:775 *

===== Routine Calling Stack =====

   14 mp_cart_sub
   13 dbcsr_complete_redistribute
   12 copy_dbcsr_to_fm
   11 cp_dbcsr_sm_fm_multiply
   10 apply_op_1
    9 apply_op
    8 linres_solver

:

Ron

alazzaro commented 1 year ago

The error is at line 775 when it is freeing the communicators? I wonder if there is a cumulative limit on that... On the other hand, maybe we can think a way to cache those subcommunicators rather then creating/destroying every time. Any way you can share the CP2k input for reproducing the problem?

recohen commented 1 year ago

Here is the input. Note that the failure occurs with more processors--like 256 or 128. With 64 it might work OK. Thank you!

Sincerely,

Ron toomanycomm2.tar.gz

cp2k / cp2k

dbcsr crashes with too many communicators in nmr calculation #2423

if defined(__parallel)