Request: MPI support - Githubissues

maxboone commented 4 months ago

Is your feature request related to a problem? Please describe. On high performance computing clusters, MPI is often used as an interface between multiple machines. Adding support for MPI in Bend would allow code to be scheduled on multiple machines and utilize more parallelism.

Describe the solution you'd like Figure out where to implement MPI (must this be done in HSM2 first) and add support to bend.

Describe alternatives you've considered N/A

Additional context N/A

maxboone commented 4 months ago

I'd love to pick this up over the summer, but would first like to seek opinions and see if this is even a feasible thing to do.

iljah commented 4 months ago

MPI would be nice, perhaps allowing calling MPI functions from bend would be best first step?

developedby commented 4 months ago

At the moment, each instance of the HVM runtime is self-contained. Parallelizing across multiple devices seems like youd be a very hard thing to do without losing too much performance, but I don't have experience with HPC, so I could be wrong. Definitely won't be happening with hvm32 since it wouldn't even be able to use all the resources due to its inherent size limitations, but this is an interesting direction to explore once we have hvm64.

maxboone commented 4 months ago

At the moment, each instance of the HVM runtime is self-contained.

Generally, with MPI you run each instance self-contained as well (c.q. there is no magic shared memory or naive distributed storage) but you add functions to pass messages between multiple instances through a communicator (which can be hardware accelerated).

For example, grabbing some of the simpler code from something I worked on last year:

    // Calculate the average of our parts of the clusters
    auto final = std::vector<float>(size, 0.0);

    // Split the clusters over the workers
    uint32_t local_size = size / workers;
    uint32_t local_rem = size % workers;
    uint32_t iter = local_size * worker_rank;
    uint32_t end = start + local_size + (worker_rank == 0 ? local_rem : 0);

    // calculate
    for (; iter < end; iter++) {
        final[iter] = data[iter] / sums[iter];
    }

    // Merge the cluster averages
    auto final = std::vector<float>(size, 0.0);
    MPI_Allreduce(local.data(), final.data(),  final.size(), MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD);

    return final;

Where you just split a long iteration into slices and distribute those (as they can be done independently) among workers. I presume something similar is already done with the fold logic in HVM but here we'd also allow to send it to separate workers instead of just threads.

The largest overhead here would probably be on GPU accelerated execution as naively you'd end up doing a lot of copying of memory between the GPU and the CPU to send it to (and receive from) the communicator.

I'll keep notifications on, and if this is something to be added to hvm64 I can take a shot at it.

iljah commented 4 months ago

I also suggest keeping in mind Conway's game of life type workloads where data has to be constantly transferred between (MPI) processes (that don't share memory with each other). Local process has to send copies of (some of) its cells to other processes so that they can calculate next iteration of their cells, and they have to send copies of their cells to local process so that it can calculate next iteration of its cells, repeated N times.

Supporting explicit calls to MPI functions is important for these workloads because balancing computation and communication is a science unto itself (one library that helps with this is sandialabs.github.io/Zoltan/Zoltan_phil.html for example) so trying to do that automatically most likely wouldn't work.

maxfierrog commented 4 months ago

I don't think this is trivial, because there is a fundamental difference between multicore shared-memory parallelism and distributed parallelism as allowed by MPI.

To begin with, I'm assuming you mean adding multiprocess capabilities to the HVM evaluator via an MPI implementation (and not just MPI bindings to bend-lang). I just took a look at this myself (so I apologize if I say anything non-factual), but it looks like parallel rewrites are performed without locks (using a fairly straightforward algorithm facilitated by atomic instructions).

I am willing to bet a good sum of money that this is key to getting reasonable performance out of this in cases where the combinator graph is "highly interconnected" (high edge/vertex ratio, high average degree, a lot of cliques), as the cost of locking would be particularly severe here.

That being said, it would be good to have some kind of a (potentially implicit) API to partition the combinator graph into large enough sufficiently independent subgraphs (see "cheeger constant" and "conductance" for related formalisms) as a mechanism to control the overhead of locking. Then, it would be possible to distribute these partitions to different MPI nodes, allowing them to alter contested combinators through distributed locking and all other combinators as usual (through atomic intrinsics).

For example, we could define this "API" to be "functions with an underscore in front of them are assumed to be very computationally intensive." (Or at least, intensive enough for the overhead of distribution to be worth it). Then, something like this:

_sum_big(_matmul_big(A, B), _matmul_big(C, D))

..could be performed on two different nodes in parallel. Of course, I have not even visited the question of the overhead of moving data -- sometimes the computation will not be "complex," but it will not be worth it to do it in another node because of this cost (e.g., is _matmul_big costly because the matrices are big, or because it is an expensive algorithm?).

There are also other more specific yet just as important questions, like whether or not RDMA is present. All of this complicates things a lot, and probably requires theoretical groundwork before spending any time on an implementation.

iljah commented 4 months ago

adding multiprocess capabilities

No just allowing C functions to be called from bend would be enough, with MPI functions being most interesting ones in this case

via an MPI implementation

Several good MPI implementations exist already, probably not useful/worth it to copy&paste one of those into bend.

API to partition the combinator graph

This can be performed by existing libraries

_matmul_big

This can also be performed by existing libraries

question of the overhead of moving data

Libraries for optimizing this also exist

complicates things a lot

It sure does, hence my suggestion to leave everything to existing libraries/frameworks/programs and just allow C (MPI) functions to be called from bend...

maxboone commented 4 months ago

Good write-up @maxfierrog and good points!

My main request would be to first add support for MPI so that bend scripts can be easily adapted to run on large clusters and then see if that can be leveraged in some patterns such that using multiple GPUs or large clusters can be more approachable for developers.

I like the idea of creating interfaces / patterns to split up the graph for distributed computing. However, that would probably require some cost function of compute and communication to see whether it's worth to split it. Besides MPI such an approach can also be interesting to run on multiple GPUs on the same system. (See https://github.com/HigherOrderCO/Bend/issues/358)

Especially because I feel we switch quickly to C++ due to needing CUDA or OpenMP on HPC tasks, and it being easier than trying to hack that into Python.

maxfierrog commented 4 months ago

@iljah

I think that the whole point of Bend / the HVM is to automate parallelism. As such (and as you mentioned), there would be nothing special about adding MPI bindings -- there is already an issue for a FFI (see #350).

I obviously think that FFI support will be an important component for Bend if it ever wants to make it to production, as the classic optimization for hot code will always be to write it in C and compile it with a SOTA compiler. I am not refuting the value of that. (Not to mention access to tried and tested libraries in the first place).

For this reason, I think this conversation should be about getting MPI to be automatically used by the HVM in the process of rewriting the interaction net, and not about adding MPI bindings. (I swear, if Bend gets complete MPI bindings before Rust, I will be an unhappy individual).

However, the "existing libraries" you quote (which would go into the HVM and analyze the combinator graph) are nowhere near performant enough to be used there. I'm not saying they're not well implemented, I'm saying that the algorithms they execute are way too costly to run every time there is a rewrite of an interaction combinator. I am not saying there is no fast enough algorithm to do this in the specific case of a combinator net -- but that is something that we would need a peer-reviewed paper on to build anything meaningful on top of.

This doesn't mean it's over though, in particular because this is a runtime we are talking about (so it is possible to keep statistics about it and make very good estimations). See below for more.

@maxboone

Thanks Max. As I mentioned in my reply to @iljah, I agree -- it would be most utilitarian to first have a FFI. And right, something like this would be useful for other kinds of hardware offloading in general.

I think there is a lot of work to reference (in particular in the JVM's JIT compiler) when it comes to cost estimation. And yes, switching to C++ would probably be the most direct path for the HVM, since that is where this would happen (not the bend-lang compiler).

Regardless of that, to reiterate -- I think the main attraction of this is automatically invoking different forms of hardware offloading when necessary. An FFI should come first (actually, I/O should come even before that), but implementing generic hardware offloading would be a killer feature. Imagine autoscaling the map-reduce example backed by a distributed MapReduce system, or having OpenMPI automatically distribute ML workloads across different nodes, and having all of these decisions being done automatically by the runtime (with a lot of hints from the developer, probably).

iljah commented 4 months ago

MPI to be automatically used by the HVM

Ambitious goal for sure. There are probably plenty of users of MPI whose simulations cannot physically fit into one, 10, 100 etc nodes so even initializing such a simulation automatically would be challenging...

iljah commented 4 months ago

generic hardware offloading OpenMPI automatically distribute ML workloads

Also keep in mind that supercomputer vendors probably heavily modify their MPI implementation(s) to fit specific hardware so providing your own mpi, openmp, gcc, etc might not be useful

JakeQueiser commented 4 months ago

@iljah I am a MPI developer and I can assure you that MPI runtimes are tuned specifically to each cluster. Much of this tuning framework already exists in MPICH, which if Bend were to create its own MPI, it would fork off. Either way, the solution Bend will need to implement will be wrapping the bindings. I don't think you'd see any advantage by bypassing the MPI standard APIs.

maxfierrog commented 4 months ago

As @JakeDimes mentions, this would be a matter of calling an MPI implementation library from the HVM on an ad hoc basis @iljah, so there would be no need to implement the MPI for Bend specifically.

Again, very ambitious project, with many prerequisites that should be prioritized. I would argue that the HVM needs to gain a lot more traction (and research) for this form of offloading to ever make it to a stable release (well, after the project is stabilized to begin with). Some experiments and a proof of concept would be good contributions though, IMO. Having said that, it's a significant amount of specialized work.

HigherOrderCO / Bend

Request: MPI support #357