eth-cscs / COSMA

Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
BSD 3-Clause "New" or "Revised" License
196 stars 27 forks source link

COSMA crash on Perlmutter when dealing with complex values #112

Closed yaoyi92 closed 2 years ago

yaoyi92 commented 2 years ago

Dear COSMA developers,

I am able to install COSMA on the Perlmutter computer and it works fine with float/double numbers. However, when I try to use the complex numbers(zfloat/zdoulbe), the code crashed during some MPI processes and it seems the MPI cannot recognize complex values.

I am able to reproduce the crash with the cosma_miniapp. The results/running scripts/compilation commands are listed here.

Best wishes, Yi

The error

Warning: Failed writing log files to directory [/var/log/nvidia-mps]. No logs will be available.
An instance of this daemon is already running
Strategy = Matrix dimensions (m, n, k) = (746, 746, 746)
Number of processors: 27
Overlap of communication and computation: OFF.
Divisions strategy: 
parallel (m / 3)
parallel (n / 3)
parallel (k / 3)
Required memory per rank (in #elements): 104964
Available memory per rank (in #elements): not specified (assumed: infinite)

MPICH ERROR [Rank 1] [job id 2624725.5] [Fri Jul  8 07:31:35 2022] [nid001516] - Abort(873010691) (rank 1 in comm 0): Fatal error in PMPI_Allgatherv: Invalid datatype, error stack:
PMPI_Allgatherv(495): MPI_Allgatherv(sbuf=0x24f6f680, scount=20418, MPI_DATATYPE_NULL, rbuf=0x25242180, rcounts=0x255e71a0, displs=0x255e71c0, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgatherv(421): Datatype for argument sendtype is a null datatype

aborting job:
Fatal error in PMPI_Allgatherv: Invalid datatype, error stack:
PMPI_Allgatherv(495): MPI_Allgatherv(sbuf=0x24f6f680, scount=20418, MPI_DATATYPE_NULL, rbuf=0x25242180, rcounts=0x255e71a0, displs=0x255e71c0, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgatherv(421): Datatype for argument sendtype is a null datatype
MPICH ERROR [Rank 2] [job id 2624725.5] [Fri Jul  8 07:31:35 2022] [nid001516] - Abort(403248643) (rank 2 in comm 0): Fatal error in PMPI_Allgatherv: Invalid datatype, error stack:
PMPI_Allgatherv(495): MPI_Allgatherv(sbuf=0x24f6f770, scount=20418, MPI_DATATYPE_NULL, rbuf=0x25242270, rcounts=0x255e7290, displs=0x255e72b0, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgatherv(421): Datatype for argument sendtype is a null datatype

aborting job:
Fatal error in PMPI_Allgatherv: Invalid datatype, error stack:
PMPI_Allgatherv(495): MPI_Allgatherv(sbuf=0x24f6f770, scount=20418, MPI_DATATYPE_NULL, rbuf=0x25242270, rcounts=0x255e7290, displs=0x255e72b0, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgatherv(421): Datatype for argument sendtype is a null datatype
MPICH ERROR [Rank 3] [job id 2624725.5] [Fri Jul  8 07:31:35 2022] [nid001516] - Abort(470357251) (rank 3 in comm 0): Fatal error in PMPI_Allgather: Invalid datatype, error stack:
PMPI_Allgather(425): MPI_Allgather(sbuf=0x24f6e8d0, scount=20584, MPI_DATATYPE_NULL, rbuf=0x252413d0, rcount=20584, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgather(375): Datatype for argument sendtype is a null datatype

aborting job:
Fatal error in PMPI_Allgather: Invalid datatype, error stack:
PMPI_Allgather(425): MPI_Allgather(sbuf=0x24f6e8d0, scount=20584, MPI_DATATYPE_NULL, rbuf=0x252413d0, rcount=20584, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgather(375): Datatype for argument sendtype is a null datatype
MPICH ERROR [Rank 5] [job id 2624725.5] [Fri Jul  8 07:31:35 2022] [nid001516] - Abort(201921795) (rank 5 in comm 0): Fatal error in PMPI_Allgather: Invalid datatype, error stack:
PMPI_Allgather(425): MPI_Allgather(sbuf=0x24f71780, scount=20667, MPI_DATATYPE_NULL, rbuf=0x25246190, rcount=20667, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgather(375): Datatype for argument sendtype is a null datatype

aborting job:
Fatal error in PMPI_Allgather: Invalid datatype, error stack:
PMPI_Allgather(425): MPI_Allgather(sbuf=0x24f71780, scount=20667, MPI_DATATYPE_NULL, rbuf=0x25246190, rcount=20667, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgather(375): Datatype for argument sendtype is a null datatype
MPICH ERROR [Rank 10] [job id 2624725.5] [Fri Jul  8 07:31:35 2022] [nid001516] - Abort(67704323) (rank 10 in comm 0): Fatal error in PMPI_Allgatherv: Invalid datatype, error stack:
PMPI_Allgatherv(495): MPI_Allgatherv(sbuf=0x24f727f0, scount=20667, MPI_DATATYPE_NULL, rbuf=0x25247200, rcounts=0x255edeb0, displs=0x255eded0, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgatherv(421): Datatype for argument sendtype is a null datatype

aborting job:
Fatal error in PMPI_Allgatherv: Invalid datatype, error stack:
PMPI_Allgatherv(495): MPI_Allgatherv(sbuf=0x24f727f0, scount=20667, MPI_DATATYPE_NULL, rbuf=0x25247200, rcounts=0x255edeb0, displs=0x255eded0, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgatherv(421): Datatype for argument sendtype is a null datatype
MPICH ERROR [Rank 14] [job id 2624725.5] [Fri Jul  8 07:31:35 2022] [nid001516] - Abort(336139523) (rank 14 in comm 0): Fatal error in PMPI_Allgather: Invalid datatype, error stack:
PMPI_Allgather(425): MPI_Allgather(sbuf=0x24f744d0, scount=20667, MPI_DATATYPE_NULL, rbuf=0x2524ae00, rcount=20667, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgather(375): Datatype for argument sendtype is a null datatype

aborting job:
Fatal error in PMPI_Allgather: Invalid datatype, error stack:
PMPI_Allgather(425): MPI_Allgather(sbuf=0x24f744d0, scount=20667, MPI_DATATYPE_NULL, rbuf=0x2524ae00, rcount=20667, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgather(375): Datatype for argument sendtype is a null datatype
MPICH ERROR [Rank 15] [job id 2624725.5] [Fri Jul  8 07:31:35 2022] [nid001516] - Abort(201921795) (rank 15 in comm 0): Fatal error in PMPI_Allgather: Invalid datatype, error stack:
PMPI_Allgather(425): MPI_Allgather(sbuf=0x24f713c0, scount=20584, MPI_DATATYPE_NULL, rbuf=0x25245dd0, rcount=20584, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgather(375): Datatype for argument sendtype is a null datatype

aborting job:
Fatal error in PMPI_Allgather: Invalid datatype, error stack:
PMPI_Allgather(425): MPI_Allgather(sbuf=0x24f713c0, scount=20584, MPI_DATATYPE_NULL, rbuf=0x25245dd0, rcount=20584, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgather(375): Datatype for argument sendtype is a null datatype
MPICH ERROR [Rank 17] [job id 2624725.5] [Fri Jul  8 07:31:35 2022] [nid001516] - Abort(738792707) (rank 17 in comm 0): Fatal error in PMPI_Allgather: Invalid datatype, error stack:
PMPI_Allgather(425): MPI_Allgather(sbuf=0x24f745c0, scount=20667, MPI_DATATYPE_NULL, rbuf=0x2524aef0, rcount=20667, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgather(375): Datatype for argument sendtype is a null datatype

aborting job:
Fatal error in PMPI_Allgather: Invalid datatype, error stack:
PMPI_Allgather(425): MPI_Allgather(sbuf=0x24f745c0, scount=20667, MPI_DATATYPE_NULL, rbuf=0x2524aef0, rcount=20667, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgather(375): Datatype for argument sendtype is a null datatype
MPICH ERROR [Rank 18] [job id 2624725.5] [Fri Jul  8 07:31:35 2022] [nid001516] - Abort(873010691) (rank 18 in comm 0): Fatal error in PMPI_Allgatherv: Invalid datatype, error stack:
PMPI_Allgatherv(495): MPI_Allgatherv(sbuf=0x24f6f2d0, scount=20584, MPI_DATATYPE_NULL, rbuf=0x25241dd0, rcounts=0x255e82b0, displs=0x255e82d0, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgatherv(421): Datatype for argument sendtype is a null datatype

aborting job:
Fatal error in PMPI_Allgatherv: Invalid datatype, error stack:
PMPI_Allgatherv(495): MPI_Allgatherv(sbuf=0x24f6f2d0, scount=20584, MPI_DATATYPE_NULL, rbuf=0x25241dd0, rcounts=0x255e82b0, displs=0x255e82d0, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgatherv(421): Datatype for argument sendtype is a null datatype
MPICH ERROR [Rank 19] [job id 2624725.5] [Fri Jul  8 07:31:35 2022] [nid001516] - Abort(269030915) (rank 19 in comm 0): Fatal error in PMPI_Allgatherv: Invalid datatype, error stack:
PMPI_Allgatherv(495): MPI_Allgatherv(sbuf=0x24f727b0, scount=20667, MPI_DATATYPE_NULL, rbuf=0x252471c0, rcounts=0x255edf80, displs=0x255edfa0, datatype=MPI_DATATYPE_NULL, comm=comm=0x84000003) failed
PMPI_Allgatherv(421): Datatype for argument sendtype is a null datatype

The script to run cosma_miniapp

module load PrgEnv-gnu >/dev/null
#module load gcc/10.3.0 >/dev/null
module load cudatoolkit/11.2 >/dev/null
module load craype-accel-nvidia80 >/dev/null

# newer
module load craype/2.7.15 >/dev/null
module load cray-mpich/8.1.15 >/dev/null

# CUDA
export CRAY_ACCEL_TARGET=nvidia80
export LIBRARY_PATH="${CUDATOOLKIT_HOME}/../../math_libs/lib64/:$LIBRARY_PATH"
export LD_LIBRARY_PATH="${CUDATOOLKIT_HOME}/../../math_libs/lib64/:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="${CUDATOOLKIT_HOME}/lib64/:$LD_LIBRARY_PATH"
#export LD_LIBRARY_PATH=/global/homes/y/yyao_unc/software/lib/scalapack-2.2.0:$LD_LIBRARY_PATH
#export LD_LIBRARY_PATH="${LIBSCI_BASE_DIR}/cray/90/x86_64/lib" #‐l sci_cray_mpi ‐l sci_cray
export CPATH="${CUDATOOLKIT_HOME}/../../math_libs/include:$CPATH"
export CUDA_PATH="${CUDATOOLKIT_HOME}/../../math_libs/lib64/:$CUDA_PATH"

export COSMA_GPU_MEMORY_PINNING=ON
export COSMA_GPU_STREAMS=1
export COSMA_GPU_MAX_TILE_M=500
export COSMA_GPU_MAX_TILE_N=500
export COSMA_GPU_MAX_TILE_K=500

 ulimit -s unlimited

 nvidia-cuda-mps-control -d
 export SLURM_CPU_BIND="cores"
 export OMP_NUM_THREADS=1
 #export SLURM_CPU_BIND="threads"
# AIMS=/global/homes/y/yyao_unc/software/FHIaims/build_gw_gpu_wcosma_2/aims.220309.scalapack.mpi.x
 SLURM_NTASKS=64
 NUM_CORES=$SLURM_NTASKS

srun -n $NUM_CORES ./cosma_miniapp -m 746 -n 746 -k 746 --type=zdouble

The script to build COSMA


 # module load craype/2.7.13 >/dev/null
 #module load gcc/10.3.0 >/dev/null
 # module load cray-mpich/8.1.13 >/dev/null
 module load cudatoolkit/11.2 >/dev/null
 module load craype-accel-nvidia80 >/dev/null

 # newer
 module load craype/2.7.15 >/dev/null
 # module load gcc/11.2.0 >/dev/null
 module load cray-mpich/8.1.15 >/dev/null

 export CUDA_PATH=$CUDA_HOME:/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/math_libs/lib64

 export CC=cc
 export CXX=CC
 export CXXFLAGS=-std=c++11
 cmake -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=CRAY_LIBSCI -DCMAKE_INSTALL_PREFIX=/global/homes/y/yyao_unc/software/COSMA-v2.5.1/build_3/install_yy ..
 make VERBOSE=0 -j 16
 make install```
kabicm commented 2 years ago

Thanks Yi for a detailed error report.

Can you try putting MPI_C_FLOAT_COMPLEX instead of MPI_CXX_FLOAT_COMPLEX, and similarly for the double type at these lines: https://github.com/eth-cscs/COSMA/blob/783803e9a48944a16c9b95db0b027955b2594755/src/cosma/mpi_mapper.hpp#L30 https://github.com/eth-cscs/COSMA/blob/783803e9a48944a16c9b95db0b027955b2594755/src/cosma/mpi_mapper.hpp#L35

yaoyi92 commented 2 years ago

Yes, I can confirm the modification solves the problem. Thank you very much for the quick reply.

kabicm commented 2 years ago

Great to hear that! Although I am still confused why this is a problem. @rasolca do you know why MPI_C works and MPI_CXX doesn't?

@yaoyi92 keep in mind that you can also use COSMA with gpu-aware MPI or with NCCL backends, as described in the README, that should be much more performant! This should be the biggest change in this version.

yaoyi92 commented 2 years ago

Thanks! I will check them out.

kabicm commented 2 years ago

@yaoyi92 if it's not a problem for you, can you also try leaving the MPI_CXX prefixes there, but modifying the cmake: https://github.com/eth-cscs/COSMA/blob/783803e9a48944a16c9b95db0b027955b2594755/CMakeLists.txt#L124 to find_package(MPI COMPONENTS C CXX REQUIRED).

Maybe this was the problem?

rasolca commented 2 years ago

It is not a cmake problem. It is a problem with newer versions of Cray mpich. MPI_CXX_FLOAT_COMPLEX, MPI_CXX_DOUBLE_COMPLEX and MPI_CXX_BOOL are not set (the MPI standard requires them even if the C++ bindings are not provided). We opened a ticket about it some time ago but still no solution from HPE side.

kabicm commented 2 years ago

@rasolca do you propose then to put MPIC in the code as a temporary solution?

rasolca commented 2 years ago

In general not, but it is needed for Cray-EX systems.

kabicm commented 2 years ago

Alright! As a temporary solution we modified this in commit 5e71fac until cray-mpich fixes it.