bofenghuang / vigogne

French instruction-following and chat models
Apache License 2.0
500 stars 47 forks source link

[IDEA] RedPajama-Data-1T #12

Open svupper opened 1 year ago

svupper commented 1 year ago

Creating a French Llama version by translating RedPajama dataset

bofenghuang commented 1 year ago

Hi @svupper,

Meta's LLaMA model has been trained on a massive amount of data - 1.0T/1.4T tokens on 2048 A100s (80GB) over a period of 5 months. Continuing the pre-training of the LLaMA model on a French corpus is definitely a promising approach to improve its performance on the French language. However, this option is still quite expensive and may require significant computational resources. I'm currently pre-training it on a small French dataset to see if it improves a lot. Stay tuned!