Question about the weight decay

X-PLUG / mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family

https://www.modelscope.cn/studios/damo/mPLUG-Owl

MIT License

2.25k stars 171 forks source link

Question about the weight decay #177

Closed YifanXu74 closed 10 months ago

YifanXu74 commented 10 months ago

Hi, very nice work!

I have noticed that the weight decay values of each stage are different in both mPLUG-OWL and mPLUG-OWL2. For example, 0.05 for pre-training and 0 for fine-tuning in mPLUG-OWL2. Why doing this difference and what is the strategy to choose the values?

MAGAer13 commented 10 months ago

For pre-training, we do not want to too much overfit the datasets. During SFT stage, we follow the common practice used in LLM with 1 epoch only, which would not lead to overfitting problem.