JuliaLinearAlgebra / BLASBenchmarksGPU.jl

Benchmark BLAS libraries on GPUs
https://julialinearalgebra.github.io/BLASBenchmarksGPU.jl/stable/
Other
3 stars 2 forks source link

Other libraries to include? #14

Open DilumAluthge opened 3 years ago

DilumAluthge commented 3 years ago

What other libraries would we like to include in the benchmarks?

DilumAluthge commented 3 years ago

Are there other libraries that we should be including? In particular, are there any other Julia packages that provide matrix multiplication in Julia on the GPU?

Some people that might be interested: @maleadt @thomasfaingnaert @chriselrod @mcabbott @vchuravy

thomasfaingnaert commented 3 years ago

Is this package limited to just GEMM? If not, there's also cuTENSOR, which builds on top of CUTLASS to provide highly performant tensor contractions on GPUs.

DilumAluthge commented 3 years ago

Is this package limited to just GEMM?

Not at all! It would be great to have a variety of problems that we benchmark.

DilumAluthge commented 3 years ago

If not, there's also cuTENSOR, which builds on top of CUTLASS to provide highly performant tensor contractions on GPUs.

Added to the list!

We already have Artifacts for cuTENSOR, right? For use with CUDA.jl?

mcabbott commented 3 years ago

I believe that TensorOperations.jl wraps at least some cuTensor operations.

Some permutedims operations might be worth considering for your non-gemm list.

mcabbott commented 3 years ago

Another thing you might consider including is OMEinsum's fallback method, which is a lazy broadcasting routine. (It calls BLAS where possible, this is intended for other weird contractions.) And possibly just plain broadcasting, too.

Syntax, and quick CPU timing:

julia> N = 500; A = rand(N,N); B = rand(N,N); C = similar(A*B);

julia> C13 = reshape(C, N,1,N); B23 = reshape(B, 1,N,N);

julia> @btime sum!($C13, $A .* $B23);
  512.367 ms (2 allocations: 953.67 MiB)

julia> using OMEinsum

julia> @btime OMEinsum.loop_einsum!(ein"ik,kj->ij", ($A, $B), $C, $(IndexSize('i'=>N, 'j'=>N, 'k'=>N)));
  162.147 ms (2 allocations: 80 bytes)

julia> using LazyArrays

julia> @btime sum!($C13, LazyArray(@~ $A .* $B23));
  1.065 s (0 allocations: 0 bytes)