Matrix-Matrix multiply is quite slow

vchuravy commented 5 years ago

While looking with @yingboma into getting a PDE solved just by using DArray we encountered that matrix-matrix multiply is quite slow in DArray.

From discussion with @andreasnoack

Distributed currently has no fetch! e.g. a fetch into a localarray, it is therefore hard to avoid temporaries when working across processes. This causes many copies and requires GC work which creates communication bottlenecks.
Our communication layer doesn't support RDMA so there are copies happening in the network-stack, and we use sockets instead of shared memory for commincation on the same node.
There are some communications bottlenecks due to how we use the event-loop and it is feasible to to get into a situation where forward progress is hard to make due a machine being busy with computation and not communicating in a timely fashion.

using Distributed
addprocs(4)

using LinearAlgebra

# Set worker BLAS to one thread onlye
@sync for p in workers()
    @async remotecall_wait(LinearAlgebra.BLAS.set_num_threads, p , 1)
end

using BenchmarkTools
using DistributedArrays

const suite = BenchmarkGroup()
suite["Array"] = BenchmarkGroup()
suite["distribute"] = BenchmarkGroup()

function benchmark(T=Array, N=10)
    @benchmarkable A * B setup=(A = $T(rand($N, $N)); B = $T(rand($N, $N)))
end

for N in (2^i for i = 5:13)
    suite["Array"][N] = benchmark(Array, N)
    suite["distribute"][N] = benchmark(distribute, N)
end

tune!(suite)
results = run(suite)

I would be interested in gathering numbers from different systems here. My first set of results is from just my local laptop with 2 Cores - 4 Threads and using 4 Julia processes.

There is a lot of overhead for smallish problems, but the results aren't that bad once we get to interesting problem sizes...

julia> for (name, trial) in sort(collect(results["Array"]), by=x->time(x[2]))
           t = time(trial) / 1e6
           println(rpad(name, 25, "."), lpad(string(round(t, digits=2), " ms"), 20, "."))
       end
32.....................................0.0 ms
64....................................0.02 ms
128...................................0.06 ms
256...................................0.41 ms
512...................................3.51 ms
1024.................................24.51 ms
2048................................249.21 ms
4096...............................2226.76 ms
8192..............................18990.07 ms

julia> for (name, trial) in sort(collect(results["distribute"]), by=x->time(x[2]))
           t = time(trial) / 1e6
           println(rpad(name, 25, "."), lpad(string(round(t, digits=2), " ms"), 20, "."))
       end
32....................................2.01 ms
64....................................2.06 ms
128...................................2.32 ms
256...................................2.97 ms
512...................................6.63 ms
1024..................................34.2 ms
2048................................261.15 ms
4096...............................2295.89 ms
8192..............................17112.45 ms

montyvesselinov commented 5 years ago

julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

julia> for (name, trial) in sort(collect(results["Array"]), by=x->time(x[2]))
                  t = time(trial) / 1e6
                  println(rpad(name, 25, "."), lpad(string(round(t, digits=2), " ms"), 20, "."))
              end
32.....................................0.0 ms
64....................................0.02 ms
128...................................0.04 ms
256...................................0.19 ms
512....................................1.4 ms
1024...................................9.3 ms
2048.................................71.92 ms
4096................................600.13 ms
8192...............................5114.26 ms

julia> for (name, trial) in sort(collect(results["distribute"]), by=x->time(x[2]))
                  t = time(trial) / 1e6
                  println(rpad(name, 25, "."), lpad(string(round(t, digits=2), " ms"), 20, "."))
              end
32....................................1.46 ms
64....................................1.48 ms
128...................................1.73 ms
256...................................2.36 ms
512...................................5.35 ms
1024.................................24.76 ms
2048................................134.15 ms
4096...............................1304.26 ms
8192...............................9542.33 ms

ViralBShah commented 4 years ago

Does Elemental provide a matmul? Does someone know if we use it in Elemental.jl? Should we remove the slow matmul we have in here, and let other packages provide it for now?

ViralBShah commented 4 years ago

This package by @haampie might come in handy:

https://github.com/haampie/COSMA.jl

andreasnoack commented 4 years ago

Yes. It has a BLAS-like interface https://github.com/elemental/Elemental/blob/master/include/El/blas_like/level3.h

andreasnoack commented 4 years ago

Should we remove the slow matmul we have in here, and let other packages provide it for now?

It comes at a price though since it effectively makes the package dependent on MPI. It will also restrict the matmul to a few element types.

ViralBShah commented 4 years ago

Can we use Elemental's matmul today (as in it is already wrapped up), or we still need to do the wrapping?

The MPI dependency is unavoidable practically. I say matmul already falls into the linear algebra camp, and at that point avoiding MPI is not straightforward. Also, we now have MPI and all the MPI libraries in BinaryBuilder - and it is all much easier to get working out of the box.

vchuravy commented 4 years ago

I don't think we should add a MPI dependency to DistributedArrays. It will make it much harder to untangle and not necessary from a technical point of view. From a practical position Elemental gives a better user-experience right now, but I don't wan us to be pegged long-term against MPI.

ViralBShah commented 4 years ago

Not suggesting adding MPI to DistributedArrays. Suggesting that if you want fast matmul, you should just use one of the MPI packages. Maybe we document this - since there are some benefits to the generic matmul we have.

andreasnoack commented 4 years ago

It's wrapped here. I think it also ends up being used in https://github.com/JuliaParallel/Elemental.jl#truncated-svd although most of the work there is GEMV not GEMM.

ViralBShah commented 4 years ago

Can it override the * for darrays? Will people be up in arms about type piracy?

andreasnoack commented 4 years ago

I think it would be okay. We do similar things already in https://github.com/JuliaParallel/Elemantal.jl/src/julia/darray.jl. However, the code for copying data to and from the Elemental arrays is not robust and needs some work.

haampie commented 4 years ago

I'm overriding mul! for T<:BlasFloat in COSMA.jl. I wouldn't say it's type piracy when it only adds a specialized method

andreasnoack commented 4 years ago

I wouldn't say it's type piracy when it only adds a specialized method

I think most people would consider that type piracy anyway but also that it's one of the cases where type piracy makes sense.

ViralBShah commented 4 years ago

I agree. Let's do what it takes to make the user experience pleasant. This may be the first time in a long while that all the parallel packages are coming together.

haampie commented 4 years ago

Yes. Last thing to sort out is how to swap out libmpi for jll packages. Neither Elemental nor COSMA currently work with a system MPI implementation. I haven't had time yet to see if the MPI.jl trick can be repeated in those packages. Probably it has to happen in Elemental_jll.jl and COSMA_jll.jl. But then the situation might be slightly different, because in MPI.jl Julia searches for a libmpi, whereas in those jll packages the shared libs depend on libmpi, and I don't know (yet) how the search paths can be changed then.

Ok, 30s research later: we would need to change the search paths in these generated files I would think: https://github.com/JuliaBinaryWrappers/Elemental_jll.jl/blob/master/src/wrappers/x86_64-apple-darwin14-cxx11.jl

ViralBShah commented 4 years ago

Since the libmpi are not interchangeable - so if you move away from what we do in BB, you have to do a source build. Search paths may not be sufficient.

haampie commented 4 years ago

Yeah, I'm aware of that. But for Cray clusters MPICH is interchangeable with the system library, and that would be my use-case. So even when Julia provides a virtual MPI package where the user can pick their favorite implemenation, and BB supports all this, there still has to be a way to use a system MPI version to get optimal performance.

JuliaParallel / DistributedArrays.jl

Matrix-Matrix multiply is quite slow #187