VishnuPJ / MalayaLLM

A Continually LoRA PreTrained and FineTuned 7B Llama-2 Indic model for Malayalam Language.
46 stars 5 forks source link

MalayaLLM [മലയാളം/Malayalam]

MalayaLLM Image

This is an attempt to construct a Large Language Model (LLM) focused on generative AI for Malayalam language. While several LLMs are proficient in supporting multiple languages, including Malayalam, enhancing their performance for specific tasks such as content generation and question answering specifically in Malayalam can be achieved through dedicated training on a Malayalam dataset. In pursuit of this, I've undertaken the continuous pre-training of the LLAMA2 model using a comprehensive Malayalam dataset.

The model is currently in its early stages, and ongoing training and fine-tuning with a more comprehensive dataset are necessary to enhance its performance. I will consistently provide updated revisions to the model.

Model description

The MalayaLLM models have been improved and customized to incorporate a comprehensive Malayalam vocabulary comprising approximately 18,000 tokens, expanding upon the groundwork laid by the original LLaMA-2.

Model Update

Latest MalayaLLM model trained with Gemma-2 can be found here : MalayaLLM-Gemma2-9B

Datasets Used

Model Type Data Base Model # Params Download Links
MalayaLLM 7B Base #v0.1 Base model 12GB LLaMA 7B 7B HF Hub
MalayaLLM 7B Instruct #v0.1 Instruction following model 52k instructions MalayaLLM 7B Base 7B HF Hub
MalayaLLM 7B Instruct #v0.2 Instruction following model 52k instructions MalayaLLM 7B Base 7B HF Hub

Quantized Version of Available Models

Model Format Bits Download Links
MalayaLLM 7B Instruct #v0.1 GGUF Q8_0 HF Hub
MalayaLLM 7B Instruct #v0.2 GGUF Q8_0 HF Hub

A simple example code

import os
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline,
)

model_name = "VishnuPJ/MalayaLLM_7B_Instruct_v0.2"
print(f"Loading model...")
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

pipe = pipeline(task="text-generation", model=base_model, tokenizer=tokenizer, max_length=200)
sys_prompt = "ഒരു ടാസ്ക് വിവരിക്കുന്ന ഒരു നിർദ്ദേശം ചുവടെയുണ്ട്. അഭ്യർത്ഥന ശരിയായി പൂർത്തിയാക്കുന്ന ഒരു പ്രതികരണം എഴുതുക."

while True:
    inst = input("Enter instruction (or 'exit' to end): ")
    if inst.lower() == 'exit':
        break
    # Generate response using the user-provided instruction
    result = pipe(f"{sys_prompt} ### Instruction: {inst} ### Response:")
    # Print the generated text
    print(result[0]['generated_text'].split("### Response:")[1])

Example Output

Enter instruction (or 'exit' to end): സൂര്യൻ ഉദിക്കുന്ന ദിശ ഏതെന്നു പറയുക .
ഒരു ടാസ്ക് വിവരിക്കുന്ന ഒരു നിർദ്ദേശം ചുവടെയുണ്ട്. അഭ്യർത്ഥന ശരിയായി പൂർത്തിയാക്കുന്ന ഒരു പ്രതികരണം എഴുതുക. ### Instruction: സൂര്യൻ ഉദിക്കുന്ന ദിശ ഏതെന്നു പറയുക . ### Response: സൂര്യൻ ഉദിക്കുന്ന ദിശ കിഴക്കായിരിക്കും.
Enter instruction (or 'exit' to end): Where does the Sun rise?
ഒരു ടാസ്ക് വിവരിക്കുന്ന ഒരു നിർദ്ദേശം ചുവടെയുണ്ട്. അഭ്യർത്ഥന ശരിയായി പൂർത്തിയാക്കുന്ന ഒരു പ്രതികരണം എഴുതുക. ### Instruction: Where does the Sun rise? ### Response: The Sun rises in the east.
Enter instruction (or 'exit' to end): exit

Demo Video

Below is a brief video highlighting the model's bilingual ability to converse in both Malayalam and English. In this demonstration, I utilize Google's transliteration tool to seamlessly translate from Manglish to Malayalam. Subsequently, I copied the translated text into the prompt console for further interaction.

https://github.com/VishnuPJ/MalayaLLM/assets/54801493/c294b26d-66ef-4e07-94f0-fc81f2d1e026

Getting Started

Steps to run pretraining and finetuning

1) Download the dataset.

* Go to Data folder.
* Download all the file in the link "[CulturaX](https://huggingface.co/datasets/uonlp/CulturaX/tree/main/ml)" to a folder "data/CulturaX".
* Run "parquet2txt.py" .It will create a file "data_clm.txt".
* Download "[ai4bharat](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/indiccorp/ml.tar.xz)" and unzip it.
* Copy "data_clm.txt" and "ml.txt" to a folder "data/ml". 

2) Tokenization

3) Pretrain

4) Finetune

5) Inference

* If you want you can merge the finetuned LORA  weights in "output_finetune" folder with the MalayaLLM pretrained weight in "merged_lora_llama_pretrained_hf" folder using "merge_lora_with_llama.py".
* Otherwise we will load both the weight files and merge while inferencing.
* Run "infer.py" for inferencing. Change the instuction to generate the response.
* You can use "[Transliterate](https://www.google.com/intl/ml/inputtools/try/)" to transliterate from Manglish to Malayalam.

6) Generate .GGUF model

* Refer the link [hf-gguf](https://www.substratus.ai/blog/converting-hf-model-gguf-model/)

7) Push to hub.

* Run "Utils\push2hub.py".

Reference

* [Continual Pre-training of Language Models](https://arxiv.org/abs/2302.03241)
* [Llama 2](https://arxiv.org/abs/2307.09288)
* [Chinese-LLaMA](https://github.com/ymcui/Chinese-LLaMA-Alpaca/tree/main)
* [tamil-llama](https://github.com/abhinand5/tamil-llama/blob/main)