GPUPeople / spECK

Efficient SpGEMM on GPU using CUDA and CSR
MIT License
50 stars 16 forks source link

an illegal memory access was encountered spECK/source/GPU/Multiply.cu at line 251 #10

Closed zzzlxhhh closed 2 years ago

zzzlxhhh commented 2 years ago

I used a manufactured matrix data crafted by myself to test spECK for some purposes. Here is the testset for you to reproduce the error: test.zip

On 5.10.16.3-microsoft-standard-WSL2(ubuntu 20.04LTS)with CUDA 11.6 with rtx3060-laptop

I got the following problems handling a 2048*2048 matrix with a sparse rate of 0.2

image

Maybe it's because my matrix got too many intermediate products. but I didn't meet any problem on webbase 1M whose inter-products are also quite a lot. It would be helpful if you can give some thoughts on my issue.

dabeschte commented 2 years ago

Hm, difficult to say what the problem might be. I did not try CUDA 11.6 yet, but I tried your example on my PC (Windows 10, CUDA 11.3, RTX 3090) and both matrices run just fine. If the error is reproducable on you machine, can you try simply commenting out parts of the code until you find the issue? If you can't find it, I can also try installing CUDA 11.6

zzzlxhhh commented 2 years ago

Hi! I want to confirm that you choose to run the file[ sparse_2048matrix_02.mtx ] but not [sparse_2048matrix_01.mtx]. I can run [sparse_2048matrix_01.mtx] with sparsity of 0.1 without any problem. The tricky one is the [sparse_2048matrix_02.mtx] with sparsity of 0.2. And I also runned it on a cuda11.4 on rtx2070 coming around the same problem. so far I use compute-sanitizer to check the memory issue, It gives the following report on function readOperations() in common.cuh

➜  build git:(cuda11) ✗ compute-sanitizer ./runspECK /home/zlx/CUDAproject/GEMMdata/sparse_2048matrix_02.mtx |more
========= COMPUTE-SANITIZER
========= Invalid __global__ read of size 4 bytes
=========     at 0x820 in void readOperations<unsigned int, double, unsigned int, (unsigned int)128, (unsigned int)11904>(dCSRNoDealloc<T2>, dCSRNoDealloc<T2>, T3 *, i
nt, T1 *, T1 *, T1 *, T1 *)
=========     by thread (127,0,0) in block (90,0,0)
=========     Address 0x72bc60000 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x72b920000 of size 3407872 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x25c50c]
=========                in /usr/lib/wsl/drivers/nvlt.inf_amd64_3267c8192f4e6024/libcuda.so.1.1
=========     Host Frame:__cudart803 [0x45f4b]
=========                in /home/zlx/CUDAproject/spECK/build/./runspECK
=========     Host Frame:cudaLaunchKernel [0xa0da8]
=========                in /home/zlx/CUDAproject/spECK/build/./runspECK
=========     Host Frame:void spECK::MultiplyspECKImplementation<double, 4, 1024, 101376, 49152>(dCSR<double> const&, dCSR<double> const&, dCSR<double>&, spECK::spECKC
onfig&, Timings&) [0x3a56a]
=========                in /home/zlx/CUDAproject/spECK/build/./runspECK
=========     Host Frame:Executor<double>::run() [0x136a4]
=========                in /home/zlx/CUDAproject/spECK/build/./runspECK
=========     Host Frame:main [0xdd45]
=========                in /home/zlx/CUDAproject/spECK/build/./runspECK
=========     Host Frame:__libc_start_main [0x24083]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0xecee]
=========                in /home/zlx/CUDAproject/spECK/build/./runspECK
zzzlxhhh commented 2 years ago

So far I can only find the problem here error)%3B) As you can see, it still got some issue around the readOperations() function here:

readOperations<IndexType, DataType, IndexType, threadsPerBlock, entriesPerWarpCounting * warpsCounting><<<divup(uint32_t(matA.rows), rowsPerBlock), threadsPerBlock>>>(
            matA, matB, d_rowOperations, rowsPerBlock, d_maxComputationsPerRow, d_rowColMinMax, d_rowMaxOperations, d_sumProducts);

// copying both values at once gives a huge performance boost
uint32_t tmpArr[2];
HANDLE_ERROR(cudaMemcpy(&tmpArr, d_sumProducts, sizeof(uint32_t) * 2, cudaMemcpyDeviceToHost));

I also tried it on my windows11 using CUDA11.6 , and I got the same error. Do you consider it a CUDA version issue? I don't think it's reasonable. I'd rather believe it's a hardware issue as my rtx3060-laptop has less memory(6GB) than your 3090 ? It would be highly appreciated if you can try to reproduce the error. And let's dig out the exact problem.

dabeschte commented 2 years ago

Hey! Quick update: I am able to reproduce the error on my GTX 1080 TI on Windows 11 and with CUDA 11.3. However, I double checked and it actually does work on my RTX 3090 which is very interesting. I am on it and will let you know once I found the error.

The matrix is tiny and it should not be a memory issue. Unless cuSPARSE, which I use as a reference to check if spECK's results are correct, uses too much memory. 4M elements is nothing really.

dabeschte commented 2 years ago

Okay, so there were two problems:

Please try it out on your machine and let me know whether that fixed your problem

zzzlxhhh commented 2 years ago

Nice! The problem got fixed !
As you have mentioned, some error were triggered by the code, but it's interesting that it didn't cause any problem in 3090. Any further thoughts on this? Thanks for all the patience and efficiency you have invested on this problem.

dabeschte commented 2 years ago

Awesome.

I think that is just pure luck that the 4 bytes were allocated on my 3090, but not on the 1080 ti. Using the compute-sanitizer, I could verify that the invalid memory access also appeared on the 3090.

Thanks for cooperating and sharing the example matrix! Let me know if you encounter another issue :)

zzzlxhhh commented 2 years ago

So far so good :) The issue can be closed.

zzzlxhhh commented 2 years ago

Hi , I came up to an question relate to this. This time I didn't run into any error, but there is a problem worth to think about.

platform: old-version-speck-without-index-check ubuntu 20.04 CUDA 11.6 rtx3060-laptop

I used another manufactured matrix to test , which is 1024*1024 with sparsity of 0.2 . And the last row of the matrix is empty, which means that it could trigger the bug in older version of code.

Running the old version of code without index check, I came along with the same problem like you did in 3090, whose output is OK but with invalid memory access in compute-sanitizer check.

image

Sure, that's not your problem. I think we can blame it on Nvidia. It's interesting to think about the exact behavior when Nvidia handle invalid mem access. Pure luck? Any thoughts on this?

After all, thanks for all your time invested on this project. 👍😊

dabeschte commented 2 years ago

Interesting. But this confirms my assumption that the bug has to do with an empty last row.

It is interesting that it actually knows that no memory is allocated at this region, but does not throw an error. I think that the compute sanitizer checks actually allocated address regions while pure runtime checks only detect page faults on regions outside the heap. If a page is already initialized, the runtime would not detect it. So, if the next byte is on the next page, the memory manager would notice. If it is on the same page as the byte before, it won't notice. But I honestly don't really know how the compute sanitizer works, I just assume that is what it does.