Lightning-AI / lit-llama

Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
Apache License 2.0
5.97k stars 518 forks source link

Best way to fine tune on Wiki Data #365

Open JulianBvW opened 1 year ago

JulianBvW commented 1 year ago

I want to fine-tune LLaMa on data I got from a fandom wiki (for example this page) and was wondering how to design the json file with its "prompt", "input", and "output"?

I can't just only use the prompt "Write the next sentance" and then put two adjacent sentences in input and output, right?

rasbt commented 1 year ago

One way would be to use the Dolly 2.0 JSON file as a template and structure your dataset in the same fashion, using the same keys. And then run the prepare_dolly script.

Screenshot 2023-06-05 at 10 47 37 AM
rasbt commented 1 year ago

Or, maybe even easier would be to structure it similar to the Alpaca dataset, which has slightly different names for the keys, and then use the prepare_alpaca script.

Screenshot 2023-06-05 at 10 52 34 AM
JulianBvW commented 1 year ago

Yes, but the question is how i can automatically fill instruction, input, and output using the web-scraped texts from the wiki pages?

asadabbas09 commented 1 year ago

I'm also looking for the best way of creating dataset. I suppose we have to manually create some dataset (instructions/ output) manually and then can use self instruct to expand this and use for training.

I'm not sure how much data do we need to create and what should be the length of each instruction, response.

Is there a more systematic way of creating manual data?

rasbt commented 1 year ago

Yes, but the question is how i can automatically fill instruction, input, and output using the web-scraped texts from the wiki pages?

Oh, I think I now understand what you mean. Essentially, you don't have an instruction-finetuning dataset, correct? Or in other words, it's an "unlabeled" dataset? One way would be creating an instruction dataset by imitation learning; this would involve using another LLM (like GPT-4, e.g., via the API) to generate an instruction dataset from your dataset. This is essentially how the Alpaca dataset itself was created as well (for more details: https://github.com/tatsu-lab/stanford_alpaca#data-generation-process)

Or, if you are not interested in instruction-finetuning, I guess you could use your dataset with the pretraining script to further train the model via next-word prediction on the custom dataset.

JuicyStandoffishMan commented 1 year ago

I'm very new to this, so I apologize if it's incorrect, but I believe you can just follow the unstructured data guide or adapt the prepare_shakespeare.py code.