Closed binz98 closed 5 months ago
Yes, all variants of CodeS utilize the same volume of pre-training data. However, we observed that CodeS-15B exhibits the lowest training loss, while CodeS-1B demonstrates the highest. This discrepancy is directly correlated with the model's capacity, determined by the number of parameters.
Thank you for your excellent work!
I wonder if the same amount of data (full data of codes_pretrain_corpus) is used for incremental pre-training for models of different sizes (1b to 15b)?
Looking forward to your reply.