NVIDIA / CUDALibrarySamples

CUDA Library Samples
Other
1.59k stars 341 forks source link

cuSPARSE dense matrix and sparse matrix multiplication resulting in a dense matrix #211

Open zhoeujei opened 2 months ago

zhoeujei commented 2 months ago

Why doesn't cuSPARSE support dense matrix sparse matrix multiplication resulting in a dense matrix? Many application scenarios require this. Please consider adding support.

### Tasks
- [ ] cuSPARSE
essex-edwards commented 2 months ago

The SpMM routine can do this. The API is phrased as dense = sparse * dense, but you can achieve dense = dense * sparse by transposing everything.

zhoeujei commented 2 months ago

It is indeed possible to achieve this through transposition, but it reduces efficiency. Please consider adding direct support.

essex-edwards commented 2 months ago

Thanks for the suggestion. Have you got any benchmarks or particular call sequences or matrices that seem unexpectedly slow? If you can share that, that would help us optimize for your particular use case.

zhoeujei commented 2 months ago

Yes, I set the input matrices dimensions to 1024x1024, and the output matrix dimensions to 1024x1024. The algorithm execution speed on the cusparseSpMM interface with the parameter CUSPARSE_OPERATION_TRANSPOSE is twice as fast as the algorithm execution speed on the cusparseSpMM interface with the parameter CUSPARSE_OPERATION_NON_TRANSPOSE. Therefore, I would like to ask if NVIDIA provides a library for dense matrix * sparse matrix = dense matrix operations.

essex-edwards commented 2 months ago

Okay. If I understand correctly, you are using SpMM to compute C = A^TB where A is a CSR matrix and B,C are dense matrices. You observe that this has slower performance than C=AB. This is expected. The performance loss is not due to a lack of specialize API. It's due to the data layout of A^T. When A is a CSR matrix, A^T has the entries stored column-by-column. This is not an algorithmically-convenient order. We can, of course, try to make it faster, but a significant performance gap is probably unavoidable. If possible, you can try storing A^T instead of A and using opA=NON_TRANSPOSE (or storing A in CSC format). In that case, A^T will have the data arranged row-by-row, and you should get faster performance.