cp2k / dbcsr

DBCSR: Distributed Block Compressed Sparse Row matrix library
https://cp2k.github.io/dbcsr/
GNU General Public License v2.0
135 stars 46 forks source link

mpich test failure on s390x #703

Open opoplawski opened 11 months ago

opoplawski commented 11 months ago

Describe the bug I'm working on a Fedora package for dbcsr. I'm getting test failures with mpich on s390x.

To Reproduce

/usr/bin/ctest --test-dir redhat-linux-build-mpich --output-on-failure --force-new-ctest-process -j3
Internal ctest changing into directory: /builddir/build/BUILD/dbcsr-2.6.0/redhat-linux-build-mpich
Test project /builddir/build/BUILD/dbcsr-2.6.0/redhat-linux-build-mpich
      Start  1: dbcsr_perf:inputs/test_H2O.perf
      Start  2: dbcsr_perf:inputs/test_rect1_dense.perf
      Start  3: dbcsr_perf:inputs/test_rect1_sparse.perf
 1/19 Test  #3: dbcsr_perf:inputs/test_rect1_sparse.perf ..............***Failed    2.10 sec
 DBCSR| CPU Multiplication driver                                           BLAS (D)
 DBCSR| Multrec recursion limit                                              512 (D)
 DBCSR| Multiplication stack size                                           1000 (D)
 DBCSR| Maximum elements for images                                    UNLIMITED (D)
 DBCSR| Multiplicative factor virtual images                                   1 (D)
 DBCSR| Use multiplication densification                                       T (D)
 DBCSR| Multiplication size stacks                                             3 (D)
 DBCSR| Use memory pool for CPU allocation                                     F (D)
 DBCSR| Number of 3D layers                                               SINGLE (D)
 DBCSR| Use MPI memory allocation                                              F (D)
 DBCSR| Use RMA algorithm                                                      F (U)
 DBCSR| Use Communication thread                                               T (D)
 DBCSR| Communication thread load                                            100 (D)
 DBCSR| MPI: My process id                                                     0
 DBCSR| MPI: Number of processes                                               2
 DBCSR| OMP: Current number of threads                                         2
 DBCSR| OMP: Max number of threads                                             2
 DBCSR| Split modifier for TAS multiplication algorithm                  1.0E+00 (D)
 numthreads           2
 numnodes           2
 matrix_sizes        5000        1000        1000
 sparsities  0.90000000000000002       0.90000000000000002       0.90000000000000002     
 trans NN
 symmetries NNN
 type            3
 alpha_in   1.0000000000000000        0.0000000000000000     
 beta_in   1.0000000000000000        0.0000000000000000     
 limits           1        5000           1        1000           1        1000
 retain_sparsity F
 nrep          10
 bs_m           1           5
 bs_n           1           5
 bs_k           1           5
 *******************************************************************************
 *             MPI error 5843983 in mpi_barrier @ mp_sync : Other MPI error,   *
 *               error stack:
internal_Barrier(84).......................:     *
 *                             MPI_Barrier(comm=0x84000001)                    *
 *   ___            failed
MPID_Barrier(167)..........................:        *
 *  /   \              
MPIDI_Barrier_allcomm_composition_json(132):           *
 * [ABORT]             
MPIDI_POSIX_mpi_bcast(219).................:           *
 *  \___/              
MPIDI_POSIX_mpi_bcast_release_gather(132)..:           *
 *    |     
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not *
 *  O/|      match across processes in the collective routine: Received 0 but  *
 * /| |                                 expected 1                             *
 * / \                                                    dbcsr_mpiwrap.F:1186 *
 *******************************************************************************
 ===== Routine Calling Stack ===== 
            4 mp_sync
            3 perf_multiply
            2 dbcsr_perf_multiply_low
            1 dbcsr_performance_driver
Abort(1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
STOP 1

I don't see test failures with openmpi. One difference is that mpich is being built with -DUSE_MPI_F08=ON.

Environment:

alazzaro commented 8 months ago

I've realized that we are not testing with MPI_F08 in our CI, however we did a test here https://github.com/cp2k/dbcsr/issues/661#issuecomment-1621787249 and it worked. the only difference was GCC 13.1. I will add the test to the CI. In the meantime, I see some actions here:

  1. could you build with F08 and OpenMPI?
  2. any chance you can use GCC 13.1 and mpich with F08 in DBCSR?
opoplawski commented 8 months ago

I've enabled -DUSE_MPI_F08=ON for the openmpi builds as well. Scratch builds are here (for a week or two)

F40 - gcc 13.2.1 mpich 4.1.2 - https://koji.fedoraproject.org/koji/taskinfo?taskID=110306721

Tests are still failing.

We are stuck with the version of the compiler in the distribution which is at 13.2.1 in all current Fedora releases.

Interestingly though, the tests are succeeding in F38:

https://koji.fedoraproject.org/koji/taskinfo?taskID=110306885

which is with mpich 4.0.3. So maybe it's more of an mpich issue than DBCSR. Though mpich's own basic test suite is passing.

Also different: openblas 0.3.21 -> 0.3.25