SuperScientificSoftwareLaboratory / TileSpGEMM

Source code of the PPoPP '22 paper: "TileSpGEMM: A Tiled Algorithm for Parallel Sparse General Matrix-Matrix Multiplication on GPUs" by Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu.
36 stars 7 forks source link

Failed internal tests #3

Open elvircrn opened 1 year ago

elvircrn commented 1 year ago

Ont his version of the code: https://github.com/SuperScientificSoftwareLaboratory/TileSpGEMM/pull/2

and (at least) on the following matrices from https://sparse.tamu.edu/Williams:

with the flag -D CHECK_RESULT=1, the code produced the following output, noting that the tests have failed:

Input:

./test -d 0 -aat 0 cant/cant.mtx

Output:

--------------------------------!!!!!!!!------------------------------------
device_id = 0
---------------------------------------------------------------
Device [ 0 ] GeForce GTX 1650 Ti @ 1485.00 MHz
MAT: -------------- cant/cant.mtx --------------
input matrix A: ( 62451, 62451 ) nnz = 4007383
 loadfile time    = 0.67493 sec
the tilesize = 16
SpGEMM nnzCub = 269486473
CSR to Tile conversion uses 28.78 ms
tile space overhead = 37.74 MB
step1 ----Calculate the number and tile-column index of tiles of matrixC---
step1 ---------------------- Runtime is  0.37 ms-------------------------

step2 --------Calculate the number of nonzeros of each tile of matrixC-----
step2 ---------------------- Runtime is  4.06 ms-------------------------

step3 ---------Calculate the val&col of nonzeros of matrixC-------------
step3 ---------------------- Runtime is  48.40 ms------------------------

-----------------------Malloc uses 0.71 ms-------------------------------
Non-empty tiles of C = 194910
nnzC = 17440029
CUDA  TileSpGEMM runtime is 53.63 ms, gflops = 10.05
-------------------------------check----------------------------------------
tile to CSR conversion complete!

--------------- SpGEMM (using cuSPARSE) ---------------
 - cuda SpGEMM start! Benchmark runs 1 times.
 - cuda SpGEMM completed!

nnzC = 0, nnzCub = 269486473, Compression rate =  inf
CUDA  cuSPARSE SpGEMM runtime is 1.3550 ms, GFlops = 397.7660
cuSPARSE failed!
---------------------------------------------------------------
---------------------------------------------------------------

Input:

./test -d 0 -aat 0 pdb1HYS/pdb1HYS.mtx

Output:

--------------------------------!!!!!!!!------------------------------------
device_id = 0
---------------------------------------------------------------
Device [ 0 ] GeForce GTX 1650 Ti @ 1485.00 MHz
MAT: -------------- pdb1HYS/pdb1HYS.mtx --------------
input matrix A: ( 36417, 36417 ) nnz = 4344765
 loadfile time    = 0.69516 sec
the tilesize = 16
SpGEMM nnzCub = 555322659
CSR to Tile conversion uses 33.98 ms
tile space overhead = 40.01 MB
step1 ----Calculate the number and tile-column index of tiles of matrixC---
step1 ---------------------- Runtime is  0.34 ms-------------------------

step2 --------Calculate the number of nonzeros of each tile of matrixC-----
step2 ---------------------- Runtime is  6.93 ms-------------------------

step3 ---------Calculate the val&col of nonzeros of matrixC-------------
step3 ---------------------- Runtime is  93.50 ms------------------------

-----------------------Malloc uses 0.95 ms-------------------------------
Non-empty tiles of C = 221571
nnzC = 19594581
CUDA  TileSpGEMM runtime is 101.79 ms, gflops = 10.91
-------------------------------check----------------------------------------
tile to CSR conversion complete!

--------------- SpGEMM (using cuSPARSE) ---------------
 - cuda SpGEMM start! Benchmark runs 1 times.
 - cuda SpGEMM completed!

nnzC = 0, nnzCub = 555322659, Compression rate =  inf
CUDA  cuSPARSE SpGEMM runtime is 1.3250 ms, GFlops = 838.2229
cuSPARSE failed!
---------------------------------------------------------------
---------------------------------------------------------------

However, when run against https://sparse.tamu.edu/SNAP/CollegeMsg,

Input:

./test -d 0 -aat 0 CollegeMsg/CollegeMsg.mtx

Output

--------------------------------!!!!!!!!------------------------------------
device_id = 0
---------------------------------------------------------------
Device [ 0 ] GeForce GTX 1650 Ti @ 1485.00 MHz
MAT: -------------- /home/elvircrn/tug/thesis/repo/matrices/CollegeMsg/CollegeMsg.mtx --------------
input matrix A: ( 1899, 1899 ) nnz = 20296
 loadfile time    = 0.00273 sec
the tilesize = 16
SpGEMM nnzCub = 744395
CSR to Tile conversion uses 1.14 ms
tile space overhead = 0.61 MB
step1 ----Calculate the number and tile-column index of tiles of matrixC---
step1 ---------------------- Runtime is  0.20 ms-------------------------

step2 --------Calculate the number of nonzeros of each tile of matrixC-----
step2 ---------------------- Runtime is  0.90 ms-------------------------

step3 ---------Calculate the val&col of nonzeros of matrixC-------------
step3 ---------------------- Runtime is  3.51 ms------------------------

-----------------------Malloc uses 0.46 ms-------------------------------
Non-empty tiles of C = 14154
nnzC = 407071
CUDA  TileSpGEMM runtime is 5.17 ms, gflops = 0.29
-------------------------------check----------------------------------------
tile to CSR conversion complete!

--------------- SpGEMM (using cuSPARSE) ---------------
 - cuda SpGEMM start! Benchmark runs 1 times.
 - cuda SpGEMM completed!

nnzC = 407071, nnzCub = 744395, Compression rate = 1.83
CUDA  cuSPARSE SpGEMM runtime is 1.7550 ms, GFlops = 0.8483

Validating results...
[PASSED] nnzC = 407071
[PASSED] row_pointer
[PASSED] column_index & value
---------------------------------------------------------------
---------------------------------------------------------------

the code passes it's own tests.

Let me know if more information is necessary. Therefore, I was unable to reproduce the results from the paper given this setup. Please let me know if I have made an error at some point.

Thanks, Elvir

TileSpGEMM commented 1 year ago

Hi!

The results using cuSPARSE are all zero(nnzC = 0), which indicates that it fails to complete the calculation on the symbolic phase. Could you please show me your environment setup and the memory usage when running the matrices such as 'cant'?

Thanks.

Best,

Yuyao

elvircrn commented 1 year ago

I've switched to a different PC with a larger GPU. I should have done that before posting this anyway so I apologize for this.

Here is my current environment:

make
nvcc -O3 -w -arch=compute_86 -code=sm_86 -gencode=arch=compute_86,code=sm_86 -G -Xcompiler -fopenmp -Xcompiler -mfma main.cu -o test -I/usr/local/cuda-11.3/include -L/usr/local/cuda-11.3/lib64  -lcudart  -lcusparse  -D VALUE_TYPE=double -D CHECK_RESULT=1
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
--------------------------------!!!!!!!!------------------------------------
device_id = 0
---------------------------------------------------------------
Device [ 0 ] NVIDIA GeForce RTX 3090 @ 1740.00 MHz
MAT: -------------- ../../thesis/sparse-parses/data/Williams_files/cant.mtx --------------
input matrix A: ( 62451, 62451 ) nnz = 4007383
 loadfile time    = 0.47239 sec
the tilesize = 16
SpGEMM nnzCub = 269486473
CSR to Tile conversion uses 9.51 ms
tile space overhead = 37.74 MB
step1 ----Calculate the number and tile-column index of tiles of matrixC---
step1 ---------------------- Runtime is  0.13 ms-------------------------

step2 --------Calculate the number of nonzeros of each tile of matrixC-----
step2 ---------------------- Runtime is  0.54 ms-------------------------

step3 ---------Calculate the val&col of nonzeros of matrixC-------------
step3 ---------------------- Runtime is  5.45 ms------------------------

-----------------------Malloc uses 0.47 ms-------------------------------
Non-empty tiles of C = 194910
nnzC = 17440197
CUDA  TileSpGEMM runtime is 6.62 ms, gflops = 81.37
-------------------------------check----------------------------------------
tile to CSR conversion complete!

--------------- SpGEMM (using cuSPARSE) ---------------
 - cuda SpGEMM start! Benchmark runs 1 times.
 - cuda SpGEMM completed!

nnzC = 17440029, nnzCub = 269486473, Compression rate = 15.45
CUDA  cuSPARSE SpGEMM runtime is 28.8180 ms, GFlops = 18.7026

Validating results...
[NOT PASSED] nnzC = 17440029, nnzC_golden = 17440197
[NOT PASSED] row_pointer, #err = 62451
[NOT PASSED] column_index & value, #err = 17423244 (99.90% #nnz)
---------------------------------------------------------------
---------------------------------------------------------------

Here nnzC is no longer 0.

Please keep in mind the branch I am currently on the following branch:

https://github.com/SuperScientificSoftwareLaboratory/TileSpGEMM/pull/2

Let me know how I can help you debug this.

Thanks, Elvir

TileSpGEMM commented 1 year ago

Hello!

Could you please set the environment:

make nvcc -O3 -w -arch=compute_61 -code=sm_86 -gencode=arch=compute_86,code=sm_86 -G -Xcompiler -fopenmp -Xcompiler -mfma main.cu -o test -I/usr/local/cuda-11.3/include -L/usr/local/cuda-11.3/lib64 -lcudart -lcusparse -D VALUE_TYPE=double -D CHECK_RESULT=1

and try again and check the result?

Thanks,

Yuyao

elvircrn commented 1 year ago

This results in a segfault:

./test -d 0 -aat 0 ../../thesis/sparse-parses/data/Williams_files/cant.mtx 
--------------------------------!!!!!!!!------------------------------------
device_id = 0
---------------------------------------------------------------
Device [ 0 ] NVIDIA GeForce RTX 3090 @ 1740.00 MHz
MAT: -------------- ../../thesis/sparse-parses/data/Williams_files/cant.mtx --------------
input matrix A: ( 62451, 62451 ) nnz = 4007383
 loadfile time    = 0.47318 sec
the tilesize = 16
SpGEMM nnzCub = 269486473
CSR to Tile conversion uses 9.31 ms
tile space overhead = 37.74 MB
step1 ----Calculate the number and tile-column index of tiles of matrixC---
step1 ---------------------- Runtime is  3.32 ms-------------------------

step2 --------Calculate the number of nonzeros of each tile of matrixC-----
step2 ---------------------- Runtime is  24.71 ms-------------------------

step3 ---------Calculate the val&col of nonzeros of matrixC-------------
step3 ---------------------- Runtime is  25.07 ms------------------------

-----------------------Malloc uses 2.34 ms-------------------------------
Non-empty tiles of C = 117884
nnzC = 14465986
CUDA  TileSpGEMM runtime is 55.47 ms, gflops = 9.72
-------------------------------check----------------------------------------
Segmentation fault (core dumped)

Removing O3 and runnning with valgrind, I get the following:

==738302== Memcheck, a memory error detector
==738302== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==738302== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==738302== Command: ./test -d 0 -aat 0 ../../thesis/sparse-parses/data/Williams_files/cant.mtx
==738302== 
--------------------------------!!!!!!!!------------------------------------
device_id = 0
==738302== Warning: noted but unhandled ioctl 0x30000001 with no size/direction hints.
==738302==    This could cause spurious value errors to appear.
==738302==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==738302== Warning: noted but unhandled ioctl 0x27 with no size/direction hints.
==738302==    This could cause spurious value errors to appear.
==738302==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==738302== Warning: noted but unhandled ioctl 0x25 with no size/direction hints.
==738302==    This could cause spurious value errors to appear.
==738302==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==738302== Warning: noted but unhandled ioctl 0x17 with no size/direction hints.
==738302==    This could cause spurious value errors to appear.
==738302==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==738302== Warning: set address range perms: large range [0x200000000, 0x300200000) (noaccess)
==738302== Warning: set address range perms: large range [0x15009000, 0x35008000) (noaccess)
==738302== Warning: noted but unhandled ioctl 0x19 with no size/direction hints.
==738302==    This could cause spurious value errors to appear.
==738302==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==738302== Warning: noted but unhandled ioctl 0x49 with no size/direction hints.
==738302==    This could cause spurious value errors to appear.
==738302==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==738302== Warning: noted but unhandled ioctl 0x21 with no size/direction hints.
==738302==    This could cause spurious value errors to appear.
==738302==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==738302== Warning: noted but unhandled ioctl 0x1b with no size/direction hints.
==738302==    This could cause spurious value errors to appear.
==738302==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==738302== Warning: noted but unhandled ioctl 0x44 with no size/direction hints.
==738302==    This could cause spurious value errors to appear.
==738302==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==738302== Warning: noted but unhandled ioctl 0x48 with no size/direction hints.
==738302==    This could cause spurious value errors to appear.
==738302==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
---------------------------------------------------------------
Device [ 0 ] NVIDIA GeForce RTX 3090 @ 1740.00 MHz
MAT: -------------- ../../thesis/sparse-parses/data/Williams_files/cant.mtx --------------
input matrix A: ( 62451, 62451 ) nnz = 4007383
 loadfile time    = 12.98983 sec
the tilesize = 16
SpGEMM nnzCub = 269486473
CSR to Tile conversion uses 10700.15 ms
tile space overhead = 37.74 MB
==738302== Warning: set address range perms: large range [0x60000000, 0x9ffff000) (noaccess)
step1 ----Calculate the number and tile-column index of tiles of matrixC---
step1 ---------------------- Runtime is  59.35 ms-------------------------

step2 --------Calculate the number of nonzeros of each tile of matrixC-----
step2 ---------------------- Runtime is  27.45 ms-------------------------

step3 ---------Calculate the val&col of nonzeros of matrixC-------------
step3 ---------------------- Runtime is  175.31 ms------------------------

-----------------------Malloc uses 3.47 ms-------------------------------
Non-empty tiles of C = 117884
nnzC = 14462034
CUDA  TileSpGEMM runtime is 268.08 ms, gflops = 2.01
-------------------------------check----------------------------------------
==738302== Thread 32:
==738302== Conditional jump or move depends on uninitialised value(s)
==738302==    at 0x10DF1A: tile2csr(SMatrix*) [clone ._omp_fn.0] (in /home/crncevicadm/TileSpGEMM/src/test)
==738302==    by 0x1316378D: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==738302==    by 0x133A9608: start_thread (pthread_create.c:477)
==738302==    by 0x132CA292: clone (clone.S:95)
==738302== 
==738302== Thread 1:
==738302== Conditional jump or move depends on uninitialised value(s)
==738302==    at 0x10DF1A: tile2csr(SMatrix*) [clone ._omp_fn.0] (in /home/crncevicadm/TileSpGEMM/src/test)
==738302==    by 0x1315B8E5: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==738302==    by 0x1164B6: tile2csr(SMatrix*) (in /home/crncevicadm/TileSpGEMM/src/test)
==738302==    by 0x10D284: main (in /home/crncevicadm/TileSpGEMM/src/test)
==738302== 
==738302== Thread 17:
==738302== Conditional jump or move depends on uninitialised value(s)
==738302==    at 0x11636C: tile2csr(SMatrix*) [clone ._omp_fn.1] (in /home/crncevicadm/TileSpGEMM/src/test)
==738302==    by 0x1316378D: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==738302==    by 0x133A9608: start_thread (pthread_create.c:477)
==738302==    by 0x132CA292: clone (clone.S:95)
==738302== 
==738302== Thread 1:
==738302== Conditional jump or move depends on uninitialised value(s)
==738302==    at 0x11636C: tile2csr(SMatrix*) [clone ._omp_fn.1] (in /home/crncevicadm/TileSpGEMM/src/test)
==738302==    by 0x1315B8E5: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==738302==    by 0x11657B: tile2csr(SMatrix*) (in /home/crncevicadm/TileSpGEMM/src/test)
==738302==    by 0x10D284: main (in /home/crncevicadm/TileSpGEMM/src/test)
==738302== 
tile to CSR conversion complete!

--------------- SpGEMM (using cuSPARSE) ---------------
CUSPARSE API failed at line 162 with error: initialization error (1)
==738302== 
==738302== HEAP SUMMARY:
==738302==     in use at exit: 261,900,291 bytes in 12,038 blocks
==738302==   total heap usage: 470,695 allocs, 458,657 frees, 589,192,636 bytes allocated
==738302== 
==738302== LEAK SUMMARY:
==738302==    definitely lost: 2,630,276 bytes in 5 blocks
==738302==    indirectly lost: 499,352 bytes in 2 blocks
==738302==      possibly lost: 51,035,844 bytes in 325 blocks
==738302==    still reachable: 207,734,819 bytes in 11,706 blocks
==738302==         suppressed: 0 bytes in 0 blocks
==738302== Rerun with --leak-check=full to see details of leaked memory
==738302== 
==738302== Use --track-origins=yes to see where uninitialised values come from
==738302== For lists of detected and suppressed errors, rerun with: -s
==738302== ERROR SUMMARY: 7808 errors from 4 contexts (suppressed: 0 from 0)

This seems to be the culprit:

CUSPARSE API failed at line 162 with error: initialization error (1)

at

    //--------------------------------------------------------------------------
    // CUSPARSE APIs
    cusparseHandle_t handle = NULL;
    cusparseSpMatDescr_t matA, matB, matC;

    cusparseStatus_t status = cusparseCreate(&handle);
    if (status != CUSPARSE_STATUS_SUCCESS) {
        printf("CUSPARSE API failed at line %d with error: %s (%d)\n", __LINE__, cusparseGetErrorString(status), status);
        exit(1);
    }

Thanks, Elvir