Open LaVieEnRose365 opened 4 months ago
Thanks for your attention!
I think the main difference between our work and PEFT methods is that we scale the parameters. We have observed the power of scaling like GPT, Claude, and so on. We did the experiment that the PEFT method tunes as much as parameters we scale for LoRA, however, it can not generalize well in the specific domain. We hypothesized that the PEFT method has its limitations in having the capacity to learn more knowledge, which is important in the (continual) pretraining. It is useful when doing SFT, as recently one group mentioned that at the SFT stage, the model mainly learns the style or format URIAL. I think the PEFT method is more suitable for tasks like learning the style or format, while not for learning more knowledge, which requires dense parameters to hold in the pretraining.
Recently, another interesting work also mentions this property, yi-9b. It also uses depth expansion and then trains on math and code corpus. It mentions that if they do not scale the parameters, the continual training only marginally improves the performance.
So basically I think the main difference is that we try to increase the parameters based on the initial model to do the continual training, while PEFT is more suitable for the following SFT.
I hope this will be helpful!
Hi there! It's really an interesting work, but I have following questions: