Open AmgadHasan opened 6 months ago
Hello! Thank you for your inquiry and your kind words regarding AceGPT. Specifically, AceGPT-7B's continue pretrain used a dataset of 30 billion tokens, while AceGPT-13B utilized 10 billion tokens.
Best regards,
Hello! Thank you for your inquiry and your kind words regarding AceGPT. Specifically, AceGPT-7B's continue pretrain used a dataset of 30 billion tokens, while AceGPT-13B utilized 10 billion tokens.
Best regards,
Do you guys plan to release the dataset as well ?
Hello!
Thank you so much for developing and releasing this model to the public. As a native Arabic speaker, I highly appreciate your efforts in enriching our beautiful language.
I have the following question related to the training process:
As per my understanding, the first step is continued pretending of Llama2 on Arabic data in a self supervised manner. My question is how big is the data used in this step?
Thanks in advance.