Incorrect matrix multiplication result on two different matrices

aabbas90 commented 3 years ago

Hi,

I am trying to multiply two very small (but different) matrices A: 4 x 7, B: 7 x 4 the dimension and fill pattern of the resulting matrix is correct but the values are not. For example showing one such comparison:

Row: 2 | Values do NOT match: (Ref|Comp) : (-3.000000|2.000000) - pos: 0/3 - col 0

I modified the DataLoader.cpp as follows to read the two matrices as:

template <typename ValueType>
DataLoader<ValueType>::DataLoader(std::string path) : matrices()
{
    std::string csrPath_A = "../../A_sp_cpu.hicsr"; 
    std::string csrPath_B = "../../B_sp_cpu.hicsr";

    try
    {
        std::cout << "trying to load csr file \"" << csrPath_A << "\"\n";
        matrices.cpuA = loadCSR<ValueType>(csrPath_A.c_str());
        std::cout << "successfully loaded: \"" << csrPath_A << "\"\n";

        std::cout << "trying to load csr file \"" << csrPath_B << "\"\n";
        matrices.cpuB = loadCSR<ValueType>(csrPath_B.c_str());
        std::cout << "successfully loaded: \"" << csrPath_B << "\"\n";
    }
    catch (std::exception& ex)
    {
        std::cout << "could not load csr file:\n\t" << ex.what() << "\n";
    }

    convert(matrices.gpuA, matrices.cpuA, 0);
    convert(matrices.gpuB, matrices.cpuB, 0);
}

I am attaching the matrix files. matrices.zip

Is there something I am doing wrong? Thanks!

dabeschte commented 3 years ago

Hey!

I am able to reproduce the error and will take a look at the problem. Just wanted to give you a heads up!

dabeschte commented 3 years ago

Should be fixed now. There was a typo in one of my kernels that caused the incorrect behavior. Please let me know if the problem is gone now. See https://github.com/GPUPeople/spECK/commit/367560d8efc93823474d4aadb3f21ba6c9b7bcfb

aabbas90 commented 3 years ago

@dabeschte Thanks for the quick fix! My toy example works fine now.

There is a second issue though: In some instances of my real use-case I am encountering an "illegal memory access error". I am attaching one such example instance. I have checked calling cuSparse which gives correct result.

However, cuSparse does not work for very large matrices and the alternate is too slow (see: https://forums.developer.nvidia.com/t/cusparse-cusparsescsrgemm2-much-slower-than-spgemm/181620/2). Thanks for your help!

matrices_big.zip

dabeschte commented 3 years ago

I can't reproduce this error. What hardware are you using, which GPU? Which CUDA version? Windows/Linux? The output for these matrices is:

trying to load csr file "D:/D Downloads/matrices/A_big.hicsr"
successfully loaded: "D:/D Downloads/matrices/A_big.hicsr"
trying to load csr file "D:/D Downloads/matrices/B_big.hicsr"
successfully loaded: "D:/D Downloads/matrices/B_big.hicsr"
Matrix: 333724x333724: 2220633 nonzeros
 var-SpGEMM -> NNZ: 2216832
 var-SpGEMM SpGEMM: 22.4699 ms

aabbas90 commented 3 years ago

I was using cuda_11 branch but master branch runs fine (with appropriate CUDA versions). I am using Linux. For GPUs I tried Tesla V100 and Tesla P40 and had the same results.

aabbas90 commented 3 years ago

Hi @dabeschte . Any leads to finding the reason of the crash for cuda 11? Thanks!

dabeschte commented 3 years ago

Sorry for taking so long. I am running cuda 11 too, using the 3090 on Windows 10. So I think it is very unlikely that cuda 11 is the problem. I still did not have the time to run it on Linux, but I will try to do it today. Unfortunately, I don't have the same hardware available, but I would be surprised if that is the problem.

dabeschte commented 3 years ago

Do you know where the exception is thrown? (file/line number?)

aabbas90 commented 3 years ago

The exception was thrown at:

https://github.com/GPUPeople/spECK/blob/367560d8efc93823474d4aadb3f21ba6c9b7bcfb/source/GPU/Multiply.cu#L591

The values inside the input arguments seem fine. I think the error occurs somewhere before this call already and is caught at this location.

dabeschte commented 3 years ago

Unfortunately, I was not able to reproduce the error in Ubuntu 20.04. with cuda 11.1 and the RTX 3090.

That makes debugging a little more difficult... The location of the exception could mean that the error happens in cub, but I think that is very unlikely. The exception could likely mean that spECK is accessing already freed memory. The dCSR class automatically frees all arrays in the destructor. Could it be that the destructor is called before for a copy of the object? Try to comment this line https://github.com/GPUPeople/spECK/blob/367560d8efc93823474d4aadb3f21ba6c9b7bcfb/source/dCSR.cpp#L40 to be sure.

I am sorry for the trouble. I am just really running out of ideas why this bug could happen.

aabbas90 commented 3 years ago

Hi @dabeschte , Thanks for investigating it further. I have drilled it down further:

CUDA 11.1.0 works but CUDA 11.2.2 does not. Also disabling deallocation does not solve the problem for CUDA 11.2.
Further nailed down the error to: https://github.com/GPUPeople/spECK/blob/7c9f2a77008247111c18bea167b211e0ec1b7362/source/GPU/Multiply.cu#L503
Actually Tesla V100 with CC70 works fine! Combination of Tesla P40 and CC61 does not work.

dabeschte commented 3 years ago

cuda 11.2.2 works for me too.

One problem could be that you are using more than 49152 "dynamic" memory. Or are you using the same "dynamic" as "static" memory for pascal? ...since that architecture also does not support larger dynamic sizes. https://github.com/GPUPeople/spECK/blob/367560d8efc93823474d4aadb3f21ba6c9b7bcfb/include/Multiply.h#L11

aabbas90 commented 3 years ago

For Pascal I use the default values i.e. 49152. I increased the value for dynamic memory when I tested on Volta.

dabeschte commented 3 years ago

that sounds OK. I will try to compile it for Pascal, but that might take a few days. Do you need it to work on both P40 and V100 with cuda 11.2?

aabbas90 commented 3 years ago

I am trying to use spECK in my research code which will possibly be made open-source. So I would be glad if we can find and eliminate configurations which are not supported by spECK for example not supporting CC61 is fine by me (in that case one can revert to cuSparse). So we find largest possible set of configurations which are guaranteed to work by spECK and update the getting started section: (https://github.com/GPUPeople/spECK/blob/master/readme.md#getting-started). What do you think?

dabeschte commented 3 years ago

Yeah I agree that it would be better to just fix the bugs. They should be minor anyways, since the code was originally developed on a cc61 GPU, but then optimized for a Titan V (cc70).

I found the Pascal bug which actually has nothing to do with the GPU generation. It was a bug present in all versions, but only triggered on pascal (and older) due to smaller shared memory. I hope this is the last bug now. If you find something else, please tell me!

aabbas90 commented 3 years ago

Found another failure case on two 'huge' matrices with cc70 and Tesla V100. Downloadable from: https://drive.google.com/file/d/12rqw4sSSjvhryC69-DW8bZsbPvl4LBDG/view?usp=sharing

dabeschte commented 3 years ago

Nice, thanks for pointing me to all bugs, this is a great contribution to the project! This bug is also fixed now. Let me know if you find more bugs

dabeschte commented 3 years ago

@aabbas90 are the bugs fixed now? Can I close the issue?

aabbas90 commented 3 years ago

Yes. Thanks for your support.

GPUPeople / spECK

Incorrect matrix multiplication result on two different matrices #4