As @simonpintarelli reported, some of the unit tests arising from the RPA simulation were failing with the GPU backend:
OMP_NUM_THREADS=1 CRAY_CUDA_MPS=1 srun -u -N 1 -n 8 ./miniapp/pxgemm_miniapp -m 43417 -k 2170 -n 217 --test --transpose NN -r 1
Running PDGEMM on the following problem:
=============================
GLOBAL MAT. SIZES
=============================
A = 43417 x 2170
B = 2170 x 217
C = 43417 x 217
=============================
SUBMATRICES
=============================
(ia, ja) = (1, 1)
(ib, jb) = (1, 1)
(ic, jc) = (1, 1)
=============================
SUBMATRIX SIZES
=============================
m = 43417
n = 217
k = 2170
=============================
ADDITIONAL OPTIONS
=============================
alpha = 1
beta = 0
trans_a = N
trans_b = N
=============================
PROC GRID
=============================
grid = 1 x 8
grid order = R
=============================
PROC SRCS
=============================
P_SRC(A) = (0, 0)
P_SRC(B) = (0, 0)
P_SRC(C) = (0, 0)
=============================
BLOCK SIZES
=============================
Blocks(A) = (128, 128)
Blocks(B) = (128, 128)
Blocks(C) = (128, 128)
=============================
LEADING DIMS
=============================
lld_a = 43417
lld_b = 2170
lld_c = 43417
=============================
epsilon = 1e-06, v1 = 42.5759, which is != 528.075
epsilon = 1e-06, v1 = 43.1292, which is != 528.41
COSMA TIMES [ms] = 484
SCALAPACK TIMES [ms] = 571
Result is NOT CORRECT!
The bug was only occurring when the GPU backend is used. After a careful analysis, @simonpintarelli and I realized this problem boils down to the following local multiplications, executed multiple times:
m = 5428, n = 217, k = 2170 alpha = 1, beta = 0, copy_c_back = T, tile sizes = 5000
m = 5427, n = 217, k = 2170 alpha = 1, beta = 0, copy_c_back = T, tile sizes = 5000
This bug was occurring in the GPU backend only when the matrix dimensions were slightly larger than the GPU tile sizes, as described here.
We fixed this bug in the GPU backend in the latest PR.
After updating the Tiled-MM submodule to the latest version, we verified the problem is resolved:
OMP_NUM_THREADS=1 CRAY_CUDA_MPS=1 srun -u -N 1 -n 8 ./miniapp/pxgemm_miniapp -m 43417 -k 2170 -n 217 --test --transpose NN -r 1
Running PDGEMM on the following problem:
=============================
GLOBAL MAT. SIZES
=============================
A = 43417 x 2170
B = 2170 x 217
C = 43417 x 217
=============================
SUBMATRICES
=============================
(ia, ja) = (1, 1)
(ib, jb) = (1, 1)
(ic, jc) = (1, 1)
=============================
SUBMATRIX SIZES
=============================
m = 43417
n = 217
k = 2170
=============================
ADDITIONAL OPTIONS
=============================
alpha = 1
beta = 0
trans_a = N
trans_b = N
=============================
PROC GRID
=============================
grid = 1 x 8
grid order = R
=============================
PROC SRCS
=============================
P_SRC(A) = (0, 0)
P_SRC(B) = (0, 0)
P_SRC(C) = (0, 0)
=============================
BLOCK SIZES
=============================
Blocks(A) = (128, 128)
Blocks(B) = (128, 128)
Blocks(C) = (128, 128)
=============================
LEADING DIMS
=============================
lld_a = 43417
lld_b = 2170
lld_c = 43417
=============================
COSMA TIMES [ms] = 304
SCALAPACK TIMES [ms] = 444
Result is CORRECT!
As @simonpintarelli reported, some of the unit tests arising from the RPA simulation were failing with the GPU backend:
The bug was only occurring when the GPU backend is used. After a careful analysis, @simonpintarelli and I realized this problem boils down to the following local multiplications, executed multiple times:
This bug was occurring in the GPU backend only when the matrix dimensions were slightly larger than the GPU tile sizes, as described here.
We fixed this bug in the GPU backend in the latest PR.
After updating the Tiled-MM submodule to the latest version, we verified the problem is resolved:
This has been tested on the RTX3090 GPUs.