eth-cscs / COSMA

Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
BSD 3-Clause "New" or "Revised" License
196 stars 27 forks source link

Fixed the wrong results bug in the GPU backend. #139

Open kabicm opened 9 months ago

kabicm commented 9 months ago

As @simonpintarelli reported, some of the unit tests arising from the RPA simulation were failing with the GPU backend:

 OMP_NUM_THREADS=1 CRAY_CUDA_MPS=1  srun -u -N 1 -n 8 ./miniapp/pxgemm_miniapp -m 43417 -k 2170 -n 217  --test --transpose NN -r 1

Running PDGEMM on the following problem:
=============================
      GLOBAL MAT. SIZES
=============================
A = 43417 x 2170
B = 2170 x 217
C = 43417 x 217
=============================
        SUBMATRICES
=============================
(ia, ja) = (1, 1)
(ib, jb) = (1, 1)
(ic, jc) = (1, 1)
=============================
      SUBMATRIX SIZES
=============================
m = 43417
n = 217
k = 2170
=============================
      ADDITIONAL OPTIONS
=============================
alpha = 1
beta = 0
trans_a = N
trans_b = N
=============================
         PROC GRID
=============================
grid = 1 x 8
grid order = R
=============================
         PROC SRCS
=============================
P_SRC(A) = (0, 0)
P_SRC(B) = (0, 0)
P_SRC(C) = (0, 0)
=============================
          BLOCK SIZES
=============================
Blocks(A) = (128, 128)
Blocks(B) = (128, 128)
Blocks(C) = (128, 128)
=============================
          LEADING DIMS
=============================
lld_a = 43417
lld_b = 2170
lld_c = 43417
=============================

epsilon = 1e-06, v1 = 42.5759, which is != 528.075
epsilon = 1e-06, v1 = 43.1292, which is != 528.41
COSMA TIMES [ms] = 484
SCALAPACK TIMES [ms] = 571
Result is NOT CORRECT!

The bug was only occurring when the GPU backend is used. After a careful analysis, @simonpintarelli and I realized this problem boils down to the following local multiplications, executed multiple times:

m = 5428, n = 217, k = 2170 alpha = 1, beta = 0, copy_c_back = T, tile sizes  = 5000
m = 5427, n = 217, k = 2170 alpha = 1, beta = 0, copy_c_back = T, tile sizes = 5000

This bug was occurring in the GPU backend only when the matrix dimensions were slightly larger than the GPU tile sizes, as described here.

We fixed this bug in the GPU backend in the latest PR.

After updating the Tiled-MM submodule to the latest version, we verified the problem is resolved:

OMP_NUM_THREADS=1 CRAY_CUDA_MPS=1  srun -u -N 1 -n 8 ./miniapp/pxgemm_miniapp -m 43417 -k 2170 -n 217  --test --transpose NN -r 1

Running PDGEMM on the following problem:
=============================
      GLOBAL MAT. SIZES
=============================
A = 43417 x 2170
B = 2170 x 217
C = 43417 x 217
=============================
        SUBMATRICES
=============================
(ia, ja) = (1, 1)
(ib, jb) = (1, 1)
(ic, jc) = (1, 1)
=============================
      SUBMATRIX SIZES
=============================
m = 43417
n = 217
k = 2170
=============================
      ADDITIONAL OPTIONS
=============================
alpha = 1
beta = 0
trans_a = N
trans_b = N
=============================
         PROC GRID
=============================
grid = 1 x 8
grid order = R
=============================
         PROC SRCS
=============================
P_SRC(A) = (0, 0)
P_SRC(B) = (0, 0)
P_SRC(C) = (0, 0)
=============================
          BLOCK SIZES
=============================
Blocks(A) = (128, 128)
Blocks(B) = (128, 128)
Blocks(C) = (128, 128)
=============================
          LEADING DIMS
=============================
lld_a = 43417
lld_b = 2170
lld_c = 43417
=============================

COSMA TIMES [ms] = 304
SCALAPACK TIMES [ms] = 444
Result is CORRECT!

This has been tested on the RTX3090 GPUs.

simonpintarelli commented 9 months ago

cscs-ci run P100

simonpintarelli commented 9 months ago

cscs-ci run P100