Using 5 billion tokens pretraining llama2

MiuLab / Taiwan-LLM

Traditional Mandarin LLMs for Taiwan

https://twllm.com

Apache License 2.0

1.23k stars 102 forks source link

Using 5 billion tokens pretraining llama2 #15

Closed joshhu closed 1 year ago

joshhu commented 1 year ago

Pretraining Phase: Pretrained on a vast corpus of over 5 billion tokens, extracted from common crawl in Traditional Mandarin.

was this model using random weights and bias in llama2 model architecture and only 5 billions pure traditional chinese tokens without any tokens from other languages or native llama2 PLUS 5 billions tokens? If so, what is the hardware configuration of the pretraining phase using?

thanks.

adamlin120 commented 1 year ago

We continue-pretrained llama2 on traditional mandarin