Pretraining Phase: Pretrained on a vast corpus of over 5 billion tokens, extracted from common crawl in Traditional Mandarin.
was this model using random weights and bias in llama2 model architecture and only 5 billions pure traditional chinese tokens without any tokens from other languages or native llama2 PLUS 5 billions tokens? If so, what is the hardware configuration of the pretraining phase using?
was this model using random weights and bias in llama2 model architecture and only 5 billions pure traditional chinese tokens without any tokens from other languages or native llama2 PLUS 5 billions tokens? If so, what is the hardware configuration of the pretraining phase using?
thanks.