Closed Maokui-He closed 3 months ago
Remarkable work you have done! There are some questions for training that you may not detail in the paper. Were the LLM parameters fully updated in stage 2 (Generative Pre-training)? I'm curious about the batchsize can be set to 512 on 2*8 GPUs with 40GB memory. Was the length of the training data general short?
Yes, train all LLM weights in stage 2 with zero-3. Due to limited devices, we add gradient_accumulation_steps to ensure a total batch size of 512.
Remarkable work you have done! There are some questions for training that you may not detail in the paper. Were the LLM parameters fully updated in stage 2 (Generative Pre-training)? I'm curious about the batchsize can be set to 512 on 2*8 GPUs with 40GB memory. Was the length of the training data general short?