Open vchuravy opened 5 years ago
julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin14.5.0)
CPU: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
julia> for (name, trial) in sort(collect(results["Array"]), by=x->time(x[2]))
t = time(trial) / 1e6
println(rpad(name, 25, "."), lpad(string(round(t, digits=2), " ms"), 20, "."))
end
32.....................................0.0 ms
64....................................0.02 ms
128...................................0.04 ms
256...................................0.19 ms
512....................................1.4 ms
1024...................................9.3 ms
2048.................................71.92 ms
4096................................600.13 ms
8192...............................5114.26 ms
julia> for (name, trial) in sort(collect(results["distribute"]), by=x->time(x[2]))
t = time(trial) / 1e6
println(rpad(name, 25, "."), lpad(string(round(t, digits=2), " ms"), 20, "."))
end
32....................................1.46 ms
64....................................1.48 ms
128...................................1.73 ms
256...................................2.36 ms
512...................................5.35 ms
1024.................................24.76 ms
2048................................134.15 ms
4096...............................1304.26 ms
8192...............................9542.33 ms
Does Elemental provide a matmul? Does someone know if we use it in Elemental.jl? Should we remove the slow matmul we have in here, and let other packages provide it for now?
This package by @haampie might come in handy:
Yes. It has a BLAS-like interface https://github.com/elemental/Elemental/blob/master/include/El/blas_like/level3.h
Should we remove the slow matmul we have in here, and let other packages provide it for now?
It comes at a price though since it effectively makes the package dependent on MPI. It will also restrict the matmul to a few element types.
Can we use Elemental's matmul today (as in it is already wrapped up), or we still need to do the wrapping?
The MPI dependency is unavoidable practically. I say matmul already falls into the linear algebra camp, and at that point avoiding MPI is not straightforward. Also, we now have MPI and all the MPI libraries in BinaryBuilder - and it is all much easier to get working out of the box.
I don't think we should add a MPI dependency to DistributedArrays. It will make it much harder to untangle and not necessary from a technical point of view. From a practical position Elemental gives a better user-experience right now, but I don't wan us to be pegged long-term against MPI.
Not suggesting adding MPI to DistributedArrays. Suggesting that if you want fast matmul, you should just use one of the MPI packages. Maybe we document this - since there are some benefits to the generic matmul we have.
It's wrapped here. I think it also ends up being used in https://github.com/JuliaParallel/Elemental.jl#truncated-svd although most of the work there is GEMV not GEMM.
Can it override the *
for darrays? Will people be up in arms about type piracy?
I think it would be okay. We do similar things already in https://github.com/JuliaParallel/Elemantal.jl/src/julia/darray.jl. However, the code for copying data to and from the Elemental arrays is not robust and needs some work.
I'm overriding mul!
for T<:BlasFloat
in COSMA.jl. I wouldn't say it's type piracy when it only adds a specialized method
I wouldn't say it's type piracy when it only adds a specialized method
I think most people would consider that type piracy anyway but also that it's one of the cases where type piracy makes sense.
I agree. Let's do what it takes to make the user experience pleasant. This may be the first time in a long while that all the parallel packages are coming together.
Yes. Last thing to sort out is how to swap out libmpi
for jll packages. Neither Elemental nor COSMA currently work with a system MPI implementation. I haven't had time yet to see if the MPI.jl trick can be repeated in those packages. Probably it has to happen in Elemental_jll.jl and COSMA_jll.jl. But then the situation might be slightly different, because in MPI.jl Julia searches for a libmpi, whereas in those jll packages the shared libs depend on libmpi, and I don't know (yet) how the search paths can be changed then.
Ok, 30s research later: we would need to change the search paths in these generated files I would think: https://github.com/JuliaBinaryWrappers/Elemental_jll.jl/blob/master/src/wrappers/x86_64-apple-darwin14-cxx11.jl
Since the libmpi are not interchangeable - so if you move away from what we do in BB, you have to do a source build. Search paths may not be sufficient.
Yeah, I'm aware of that. But for Cray clusters MPICH is interchangeable with the system library, and that would be my use-case. So even when Julia provides a virtual MPI package where the user can pick their favorite implemenation, and BB supports all this, there still has to be a way to use a system MPI version to get optimal performance.
While looking with @yingboma into getting a PDE solved just by using DArray we encountered that matrix-matrix multiply is quite slow in DArray.
From discussion with @andreasnoack
Distributed
currently has nofetch!
e.g. a fetch into a localarray, it is therefore hard to avoid temporaries when working across processes. This causes many copies and requires GC work which creates communication bottlenecks.Our communication layer doesn't support RDMA so there are copies happening in the network-stack, and we use sockets instead of shared memory for commincation on the same node.
There are some communications bottlenecks due to how we use the event-loop and it is feasible to to get into a situation where forward progress is hard to make due a machine being busy with computation and not communicating in a timely fashion.
I would be interested in gathering numbers from different systems here. My first set of results is from just my local laptop with 2 Cores - 4 Threads and using 4 Julia processes.
There is a lot of overhead for smallish problems, but the results aren't that bad once we get to interesting problem sizes...