Hey,
thank you for providing an example of how to use
cuSPARSELt matrix multiplication in the advanced matmul_advanced example.
Since I am curious about the mathematical foundations
about the computation, I was wondering wether the
source code (the CUDA kernels) are publicly available as well.
Or is this host computation
// host computation
auto hC_result = new float[C_size];
for (int b = 0; b < num_batches; b++) {
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
float sum = 0.0f;
for (int k1 = 0; k1 < k; k1++) {
auto posA = (A_std_layout) ? i * lda + k1 : i + k1 * lda;
auto posB = (B_std_layout) ? k1 * ldb + j : k1 + j * ldb;
posA += b * batch_strideA;
posB += b * batch_strideB;
sum += static_cast<float>(hA[posA]) * // [i][k]
static_cast<float>(hB[posB]); // [k][j]
}
auto posC = (is_rowmajor) ? i * ldc + j : i + j * ldc;
posC += b * batch_strideC;
hC_result[posC] = ReLU(sum + 1.0f /*bias*/); // [i][j]
}
}
}
essentially the implemented algorithm
and it is then compared by
// host-device comparison
int correct = 1;
for (int b = 0; b < num_batches; b++) {
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
auto pos = (is_rowmajor) ? i * ldc + j : i + j * ldc;
pos += b * batch_strideC;
auto device_value = static_cast<float>(hC[pos]);
auto host_value = hC_result[pos];
if (device_value != host_value) {
// direct floating point comparison is not reliable
correct = 0;
break;
}
}
}
}
The implemented algorithm is essentially the one provided in the example. The cuSPARSELt source code is strictly confidential and contains many low-level optimizations.
Hey, thank you for providing an example of how to use cuSPARSELt matrix multiplication in the advanced matmul_advanced example.
Since I am curious about the mathematical foundations about the computation, I was wondering wether the source code (the CUDA kernels) are publicly available as well.
Or is this host computation
essentially the implemented algorithm and it is then compared by