Parallelize and optimize Mixtral MoE

casper-hansen commented 11 months ago

⚠️ Please check that this feature request hasn't been suggested before.

[X] I searched previous Ideas in Discussions didn't find any similar feature requests.
[X] I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

The core parts of training a Mixture of Experts model is to not forget the parallelism we came from in dense models. The current implementation is a naive and simple one that does not make use of the SOTA methods present in MegaBlocks. MegaBlocks dMoEs use a reformulation of MoEs in terms of block-sparse operations, which allows us to avoid token dropping without sacrificing hardware efficiency.

More details can be found in the MegaBlocks paper.

✔️ Solution

We should look into using the dMoE (dropless-MoE) which is the efficient and core part of MegaBlocks.

I believe the correct solution is to do the following:

split the gate (router) into a separate class using LearnedRouter
implement the parallel MLP (sparse should be best) sparse_permute_and_compute

❓ Alternatives

No response

📝 Additional Context

No response

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this feature has not been requested yet.
[X] I have provided enough information for the maintainers to understand and evaluate this request.

casper-hansen commented 11 months ago

Example of parallelization. I got a 2.88% speedup but that may be due to randomness. For better parallelization and to avoid CPU synchronization:

batching the forward of the experts
device parallelization, e.g. 8x experts can run in parallel on 8x GPUs.

x_indices = [x[flat_expert_indices == i] for i in range(len(self.experts))]
futures = [torch.jit.fork(expert, x_i) for expert, x_i in zip(self.experts, x_indices)]
outputs = torch.futures.wait_all(futures)

# Assign the outputs to y in the correct sequence
for i, output in enumerate(outputs):
    y[flat_expert_indices == i] = output

casper-hansen commented 11 months ago

One interesting part of MegaBlocks is the initialization of the Linear. They have an additional experts_per_rank that can be used to implement a fully parallel execution of multiple experts at the same time.

https://github.com/stanford-futuredata/megablocks/blob/main/megablocks/layers/mlp.py#L80-L85

axolotl-ai-cloud / axolotl