Open gbs-ai opened 2 months ago
Llama 2 is trained very little in the Malayalam language. Thus its default tokenizer is not effective in tokenizing Malayalam words. To address this, we added additional trained Malayalam tokens to the default tokenizer and retrained the model to recognize these augmented tokens. Additionally, during the pretraining steps, we provided the model with a substantial amount of Malayalam text data to improve its proficiency in the language.
Thank you, it's a great information. I wonder how can use gguf model for inference. Is it load it with llama cpp module ?
@sumairrasi Yes, you need to load it using llama cpp.
Hi, its a wonderful repository, I have a doubt. I'm new to this.. how did you pretrain the llama2 base model. because malayalam is not trained in the base model right ?, its only trained on english tokens. if you trained using the base weights how the model learns the malayalam vocubulary or context ?