Can i get dataset used to train tamil-llama?

abhinand5 / tamil-llama

A New Tamil Large Language Model (LLM) Based on Llama 2

GNU General Public License v3.0

255 stars 34 forks source link

Can i get dataset used to train tamil-llama? #6

Closed bharathdpv closed 8 months ago

bharathdpv commented 9 months ago

Hello abhinand5, iam bharath kumar currently researching about local llm. I have a plan to fine tune the model based on civil construction. training dataset will be helpfull!. So I can train/finetune the model from scratch in tamil language.thanks

Regards, Bharath kumar

abhinand5 commented 8 months ago

The pretraining dataset was taken from CulturaX

from datasets import load_dataset

ds = load_dataset("uonlp/CulturaX", "ta",
    use_auth_token=True,
    split="train",
    streaming=True,
    token=hf_token,
    revision="321a983f3fd2a929cc1f8ef6207834bab0bb9e25"
)

The finetuning datasets are already linked in the README

https://github.com/abhinand5/tamil-llama?tab=readme-ov-file#datasets

bharathdpv commented 8 months ago

thank you!