Open oymzysmwe224 opened 8 months ago
Hello, have you identified the reason for the poor performance of SFT MOE? I also encountered the same problem
Hello, have you identified the reason for the poor performance of SFT MOE? I also encountered the same problem
@xiaojiangzhang I have not found the reason. I am currently working on continuing the pretraining of the merged MoE model. I suspect that the merged model follows a prompt-wise gating strategy after the merge process, while the training process is based on a token-wise gating strategy. Therefore, I am attempting to incorporate more data to help the model adapt to the token-wise strategy.
I attempted to merge 4 Yi-34B models using the MoE branch of merge-kit (with each token activating 2 experts). These four models are as follows, all of which are based on the Yi34B-base and trained with different sft data, and they rank high on the OpenLLM leaderboard.
When I directly tested the merged model (called MoE-v1), it performed better than Yi34B_sftv5_base_epoch2 on most benchmarks. However, when I tried to continue finetuning this MoE model to make it even more powerful, I got results that were completely counterintuitive. I have tried using data from sharegpt, as well as finetuning with internal in-house data. The experimental setup was a learning rate of 1e-5, training for 2 epochs, and I tried toggling the auxiliary loss for load balancing.
But all the models I got were not as good as the original MoE-v1, and the more training data I used, the worse the performance became. There was a decline in almost all benchmarks, with only some improvement in the humaneval benchmark. In subjective testing, I found:
Below are some examples. I noticed a document mentioning that it is best to merge models with similar capabilities. I'm not sure if my problem is due to the significant differences between these models, which might have caused the multiple experts to interfere with each other during further finetuning. https://docs.google.com/document/d/1_vOftBnrk9NRk5h10UqrfJ5CDih9KBKL61yvrZtVWPE/edit?pli=1
I don't know if you have had similar experiences or problems, and I hope to get help and guidance. The yaml file I used for merging is as follows. I select several training samples from the trainset of each model.