Distributed overhead - Githubissues

andreasnoack commented 8 years ago

I've now run the MPI based sample sort. All timings are on my laptop (so parallel computing works even there). Things seems to work fine for the MPI implementation. What I call "Parallel" is the DArray based implementation.

cc: @alanedelman

alanedelman commented 8 years ago

mpi doesn't seem much better than sequential

On Fri, Jul 22, 2016 at 10:40 PM, Andreas Noack notifications@github.com wrote:

I've now run the MPI based sample sort. All timings are on my laptop (so parallel computing works even there). Things seems to work fine for the MPI implementation. [image: screen shot 2016-07-22 at 10 37 02 pm] https://cloud.githubusercontent.com/assets/505001/17075286/1dbb65ea-505d-11e6-9b35-be6e5c7aab8a.png [image: screen shot 2016-07-22 at 10 36 14 pm] https://cloud.githubusercontent.com/assets/505001/17075294/32c7a53e-505d-11e6-9d2e-675c6312d5d9.png

cc: @alanedelman https://github.com/alanedelman

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaParallel/PageRankBenchmark/issues/5, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0pQhoDBIKKE4o2egweyxKeu7MaCo34ks5qYX8LgaJpZM4JTQ3G .

poulson commented 8 years ago

How many MPI workers were there?

andreasnoack commented 8 years ago

Three

alanedelman commented 8 years ago

would be great to have supercloud numbers as well

On Sat, Jul 23, 2016 at 2:29 PM, Andreas Noack notifications@github.com wrote:

Three

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaParallel/PageRankBenchmark/issues/5#issuecomment-234732982, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0pQgvtfIPi7-dfr7E35UXl6Zdv84Pxks5qYl2TgaJpZM4JTQ3G .

amitmurthy commented 8 years ago

all-to-all communication overhead in parallel sort hides any benefit derived from the parallelization. The major advantage of parallelization in this case is scalability - the ability to sort arrays that do not fit into one machine's RAM. Will be interesting to see timings between MPI and DArray for sizes 10^10 and greater.

cc: @ViralBShah

ViralBShah commented 8 years ago

I guess still on laptop?

poulson commented 8 years ago

Scalable MPI sample sorts are not uncommon in the literature (and the performance models based on Hockney's communication model explain why). And all-to-all with Bruck's algorithm is a good idea if your concern is the latency of your messaging system (as it trades O(p) messages for O(lg p) messages with a factor of lg p bandwidth penalty).

andreasnoack commented 8 years ago

I've scaled up a bit. The following is from the C3DDB cluster. It's easier to get resources on that cluster but I'll prepare numbers for the TXE-1 aka the SuperCloud as well. I'm using 132 processes distributed across 11 nodes.

andreasnoack commented 8 years ago

I've added numbers for half for the number of processes on the same number of nodes to try to see the effect of p and not just n. I've also added extrapolated timings for sequential (quick) sort.

There are many variables that could affect the time difference between the DArray and MPI timings. To mention a few (I might have misunderstood some of this so please correct me where I'm wrong)

Transport layer. DArrays use IP whereas the MPI implementation can use optimized byte transport layers (in OpenMPI language). The C3DDB cluster has Infiniband so it might use RDMA (don't know if it does) and intra-node the btl is shared memory (Vader).
LibUV. There might be inefficiencies in Julia's communication with the hardware due to the LibUV layer that all the communication has to go through, e.g. copies.
Julia's message protocol. Because it requires all messages to start with a message tag, large buffers are necessarily copied into the serialized send buffer before LibUV takes over.
Algorithms. The MPI sample sort uses MPI collectives which are probably very optimized. Julia doesn't have collectives so they are rolled in the sample sort and might therefore be less efficient.
Distributed GC. Many messages are needed to handle all the references across processes. The MPI implementation only uses intra-process GC.

I don't really know the relative importance of each of these points. I can turn off the shared memory transport for MPI and it definitely has an effect (about 25% for 66*2^24 elements) but I haven't yet found a way to e.g. allow IPoIB but not RDMA (both enabled with openib. @poulson do you know how to turn off RDMA? Do you know if the OpenMPI collectives use RDMA by default on Infiniband?) We can try to improve the DArray sample sort algorithm but I don't think that is the bottleneck.

amitmurthy commented 8 years ago

Great job on the test runs!

Some observations:

large buffers are necessarily copied into the serialized send buffer before LibUV takes over.

This has been optimized for buffers > 64K, in which case the bit arrays are exposed directly to libuv - i.e. there is no intermediate copying for both read and write.

One possible minor optimization is to call sort with a sample, as in sort(d; sample=rand(1024)) - this will avoid the initial sample collection step and can definitely be used for this benchmark where a local run of KronGraph500NoPerm can be used to generate a sample.

amitmurthy commented 8 years ago

Also distributed gc should not affect the timings in this case since those messages would have been transported after the timings for the sort has been recorded.

jakebolewski commented 8 years ago

Since this is showing up in my inbox, I can add a few things.

If I remember correctly you can control the transport protocol OpenMPI uses through the --btl byte transport layer flag (ex. --btl tcp,...). The other thing to try would be MPICH which by default does not support infiniband. Infiniband protocol is different from TCP (which should not be confused with RDMA, see below). You can force all communication over TCP with this flag. As you noted, OpenMPI by default will do message passing through shared memory segments on the same node. This can be much faster than going through the socket layer.

First I think there is some confusion about RDMA. MPI's 2-sided message passing model does not map well to RDMA operations. This is why MPI includes 1-sided messaging in the standard. I haven't looked at what you are doing but I doubt you are using this in the benchmark so using conventional MPI commands (AlltoAll, Send, Recv) you are mostly likely not using RDMA.

As all data is sent through as a serialized closure, there is 2 memory copies for arrays happening that wouldn't have to on intra node processors.

As this is a BSP algorithm I think it should be easy to profile each superstep using MPI and the Julia implementation. That should be a good way to get at the overhead and model performance differences.

andreasnoack commented 8 years ago

Thanks for the comments. I've already been playing a bit with --btl. That was how I disabled the shared memory transport. I also tried with tcp but that gave an error and I therefore assumed that communication is IPoIB but I wondered if MPI is also using other kinds of protocols over Infiniband. Partly motivated by Google hits like http://dl.acm.org/citation.cfm?id=2099325 I suggested RDMA. My implementation is just Gather, Bcast, Allgather, Alltoallv but I've also just pushed it to the repo.

Last paragraph requires some elaboration before it's operational for me.

amitmurthy commented 8 years ago

As all data is sent through as a serialized closure, there is 2 memory copies for arrays happening that wouldn't have to on intra node processors.

In Julia this is being optimized here - https://github.com/JuliaLang/julia/blob/da9c28e22f6b9f602110077f7b664febf5b807c9/base/stream.jl#L830

The closure is serialized directly to the TCP socket, and array data is either serialized to an internal buffer associated with the socket (if size < 64K) or the array pointer passed directly to libuv for larger arrays.

Same goes for read.

multi.jl used to have an IOBuffer for serializing the entire request before writing to the socket. That has been removed quite a while back - a buffer is now associated with each socket in stream.jl which also optimizes large bits type array writes and reads.

andreasnoack commented 8 years ago

I've posted the code I showed you here and the timings are here

@amitmurthy When running with MPI transport but using remotecalls, there was a lot of allocations. There might be something inefficient going on in the cman code. It would be great if you could take a look.

Regarding running the code on the SuperCloud then I've created this gist with a little program that reads the machine file and adds the processes. To get the machine file, you'll have to use the parallel environment option in qsub as well as the binary option, e.g.

qsub -b y -pe mpi_1 4 julia '-L machine.jl'

to get four nodes with a single process on each. You can also ask for

qsub -b y -pe mpi_fu 16 julia '-L machine.jl'

to get 16 processes where SGE will fill up a node at a time.

Finally, I've received four replies from various people at the SuperCloud and in short

The demand is high right now but they'll get ten extra nodes
If you use qstat -f, you'll see more nodes than you have access to
We can ask for dedicated time on the machine. I've replied that we'll ask for that when all implementations are ready.

amitmurthy commented 8 years ago

I have some ideas on how to avoid the double buffering when using MPI transport. It requires some refactoring in multi.jl . Will test it out.

JuliaParallel / PageRankBenchmark

Distributed overhead #5