Size of data for continued pretraining

FreedomIntelligence / AceGPT

Apache License 2.0

105 stars 6 forks source link

Size of data for continued pretraining #7

Open AmgadHasan opened 6 months ago

AmgadHasan commented 6 months ago

Hello!

Thank you so much for developing and releasing this model to the public. As a native Arabic speaker, I highly appreciate your efforts in enriching our beautiful language.

I have the following question related to the training process:

As per my understanding, the first step is continued pretending of Llama2 on Arabic data in a self supervised manner. My question is how big is the data used in this step?

Thanks in advance.

jianqing666 commented 6 months ago

Hello！ Thank you for your inquiry and your kind words regarding AceGPT. Specifically, AceGPT-7B's continue pretrain used a dataset of 30 billion tokens, while AceGPT-13B utilized 10 billion tokens.

Best regards,

alielfilali01 commented 5 months ago

Hello！ Thank you for your inquiry and your kind words regarding AceGPT. Specifically, AceGPT-7B's continue pretrain used a dataset of 30 billion tokens, while AceGPT-13B utilized 10 billion tokens.

Best regards,

Do you guys plan to release the dataset as well ?