The size of incremental pretraining datasets.

RUCKBReasoning / codes

The source code of CodeS (SIGMOD 2024).

https://arxiv.org/abs/2402.16347

Apache License 2.0

128 stars 18 forks source link

The size of incremental pretraining datasets. #7

Closed binz98 closed 5 months ago

binz98 commented 5 months ago

Thank you for your excellent work!

I wonder if the same amount of data (full data of codes_pretrain_corpus) is used for incremental pre-training for models of different sizes (1b to 15b)?

Looking forward to your reply.

lihaoyang-ruc commented 5 months ago

Yes, all variants of CodeS utilize the same volume of pre-training data. However, we observed that CodeS-15B exhibits the lowest training loss, while CodeS-1B demonstrates the highest. This discrepancy is directly correlated with the model's capacity, determined by the number of parameters.