Pretraining model - Githubissues

VishnuPJ / MalayaLLM

A Continually LoRA PreTrained and FineTuned 7B Llama-2 Indic model for Malayalam Language.

42 stars 4 forks source link

Pretraining model #3

Open gbs-ai opened 2 months ago

gbs-ai commented 2 months ago

Hi, its a wonderful repository, I have a doubt. I'm new to this.. how did you pretrain the llama2 base model. because malayalam is not trained in the base model right ?, its only trained on english tokens. if you trained using the base weights how the model learns the malayalam vocubulary or context ?

VishnuPJ commented 2 months ago

Llama 2 is trained very little in the Malayalam language. Thus its default tokenizer is not effective in tokenizing Malayalam words. To address this, we added additional trained Malayalam tokens to the default tokenizer and retrained the model to recognize these augmented tokens. Additionally, during the pretraining steps, we provided the model with a substantial amount of Malayalam text data to improve its proficiency in the language.

sumairrasi commented 1 month ago

Thank you, it's a great information. I wonder how can use gguf model for inference. Is it load it with llama cpp module ?

VishnuPJ commented 1 month ago

@sumairrasi Yes, you need to load it using llama cpp.