NVIDIA / CUDALibrarySamples

CUDA Library Samples
Other
1.5k stars 311 forks source link

Question: Is the source code of cuSPARSELt publicly available? #174

Closed sambaPython24 closed 6 months ago

sambaPython24 commented 6 months ago

Hey, thank you for providing an example of how to use cuSPARSELt matrix multiplication in the advanced matmul_advanced example.

Since I am curious about the mathematical foundations about the computation, I was wondering wether the source code (the CUDA kernels) are publicly available as well.

Or is this host computation

    // host computation
    auto hC_result = new float[C_size];
    for (int b = 0; b < num_batches; b++) {
        for (int i = 0; i < m; i++) {
            for (int j = 0; j < n; j++) {
                float sum = 0.0f;
                for (int k1 = 0; k1 < k; k1++) {
                    auto posA = (A_std_layout) ? i * lda + k1 : i + k1 * lda;
                    auto posB = (B_std_layout) ? k1 * ldb + j : k1 + j * ldb;
                    posA     += b * batch_strideA;
                    posB     += b * batch_strideB;
                    sum      += static_cast<float>(hA[posA]) *  // [i][k]
                                static_cast<float>(hB[posB]);   // [k][j]
                }
                auto posC       = (is_rowmajor) ? i * ldc + j : i + j * ldc;
                posC           += b * batch_strideC;
                hC_result[posC] = ReLU(sum + 1.0f /*bias*/);  // [i][j]
            }
        }
    }

essentially the implemented algorithm and it is then compared by

    // host-device comparison
    int correct = 1;
    for (int b = 0; b < num_batches; b++) {
        for (int i = 0; i < m; i++) {
            for (int j = 0; j < n; j++) {
                auto pos          = (is_rowmajor) ? i * ldc + j : i + j * ldc;
                pos              += b * batch_strideC;
                auto device_value = static_cast<float>(hC[pos]);
                auto host_value   = hC_result[pos];
                if (device_value != host_value) {
                    // direct floating point comparison is not reliable
                    correct = 0;
                    break;
                }
            }
        }
    }
fbusato commented 6 months ago

The implemented algorithm is essentially the one provided in the example. The cuSPARSELt source code is strictly confidential and contains many low-level optimizations.