I want to ask about number token using for training in Annealing Phase? I carefully check number token in this phase, but in the paper, not mention, so I want ask about number token raw text high quality you used, and number instruction sample you use in phase Annealing?
tks you for good paper, It's help me so much.
Thanks for your attention. In total we train 1.1T token, where in the decay it takes 0.1T token, and in the sft, it takes around 6B tokens (we do not count the number of instruction samples).
I want to ask about number token using for training in Annealing Phase? I carefully check number token in this phase, but in the paper, not mention, so I want ask about number token raw text high quality you used, and number instruction sample you use in phase Annealing? tks you for good paper, It's help me so much.