Open DilumAluthge opened 3 years ago
Are there other libraries that we should be including? In particular, are there any other Julia packages that provide matrix multiplication in Julia on the GPU?
Some people that might be interested: @maleadt @thomasfaingnaert @chriselrod @mcabbott @vchuravy
Is this package limited to just GEMM? If not, there's also cuTENSOR, which builds on top of CUTLASS to provide highly performant tensor contractions on GPUs.
Is this package limited to just GEMM?
Not at all! It would be great to have a variety of problems that we benchmark.
If not, there's also cuTENSOR, which builds on top of CUTLASS to provide highly performant tensor contractions on GPUs.
Added to the list!
We already have Artifacts for cuTENSOR, right? For use with CUDA.jl?
I believe that TensorOperations.jl wraps at least some cuTensor operations.
Some permutedims
operations might be worth considering for your non-gemm list.
Another thing you might consider including is OMEinsum's fallback method, which is a lazy broadcasting routine. (It calls BLAS where possible, this is intended for other weird contractions.) And possibly just plain broadcasting, too.
Syntax, and quick CPU timing:
julia> N = 500; A = rand(N,N); B = rand(N,N); C = similar(A*B);
julia> C13 = reshape(C, N,1,N); B23 = reshape(B, 1,N,N);
julia> @btime sum!($C13, $A .* $B23);
512.367 ms (2 allocations: 953.67 MiB)
julia> using OMEinsum
julia> @btime OMEinsum.loop_einsum!(ein"ik,kj->ij", ($A, $B), $C, $(IndexSize('i'=>N, 'j'=>N, 'k'=>N)));
162.147 ms (2 allocations: 80 bytes)
julia> using LazyArrays
julia> @btime sum!($C13, LazyArray(@~ $A .* $B23));
1.065 s (0 allocations: 0 bytes)
What other libraries would we like to include in the benchmarks?