I have noticed that the weight decay values of each stage are different in both mPLUG-OWL and mPLUG-OWL2. For example, 0.05 for pre-training and 0 for fine-tuning in mPLUG-OWL2. Why doing this difference and what is the strategy to choose the values?
For pre-training, we do not want to too much overfit the datasets. During SFT stage, we follow the common practice used in LLM with 1 epoch only, which would not lead to overfitting problem.
Hi, very nice work!
I have noticed that the weight decay values of each stage are different in both mPLUG-OWL and mPLUG-OWL2. For example, 0.05 for pre-training and 0 for fine-tuning in mPLUG-OWL2. Why doing this difference and what is the strategy to choose the values?