Is it true that sparse training of Mixtral-8x7B is supported?

Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development

https://llama2-accessory.readthedocs.io/

Other

2.63k stars 168 forks source link

Is it true that sparse training of Mixtral-8x7B is supported? #143

Open young-chao opened 6 months ago

young-chao commented 6 months ago

I tried different versions of pytorch and triton, but there were always different bugs that prevented training with sparsity（ the sparse implementation of mixtral-8x7b）. I now wonder whether you have done testing before supporting sparse training with mixtral in the documentation. This is very important for getting started directly. Please do not consume the enthusiasm of scientific researchers or programmers.

bao-xiaoyi commented 6 months ago

我尝试了不同版本的 pytorch 和 triton，但总是存在不同的错误，导致无法进行稀疏性训练（mixtral-8x7b 的稀疏实现）。我现在想知道您在文档中支持使用 mixtral 进行稀疏训练之前是否已经进行过测试。这对于直接开始非常重要。请不要消耗科研人员或者程序员的热情。

该问题是否已解决？

ChrisLiu6 commented 5 months ago

Sorry for the late reply. I was kind of swamped in the last few weeks. 😭

It is clear that we have indeed implemented and tested the training pipeline with the sparse implementation of Mixtral. For example, we have used it to train our SPHINX-MoE model.

We have noticed that torch 2.0.1 with cuda 11.7 and megablocks seem to conflict on the version of triton, and torch 2.0.1 with cuda 11.8 should work smoothly. If you still have other problems, please let us know.