How to train LoRA on raw text files with different structures

axolotl-ai-cloud / axolotl

Go ahead and axolotl questions

https://axolotl-ai-cloud.github.io/axolotl/

Apache License 2.0

7.86k stars 865 forks source link

How to train LoRA on raw text files with different structures #340

Closed SuperMasterBlasterLaser closed 1 year ago

SuperMasterBlasterLaser commented 1 year ago

Hello everyone.

According to this part of docs, training LoRAs require using dataset as JSON file.

The problem is that my dataset is raw text file with it's own structure. It looks like this:

[ STORY ]
Wonderful day

[ NARRATOR ]
This is wonderful day. Young boy CharA is walking down the road.

[ CharA ]
What a wonderful day. I hope I won't be a late.

[ END_STORY ]

On oobaboogas web ui, it has training LoRA on raw text file, were I just assign separation symbol. How to train LoRA by using this lib?

NanoCode012 commented 1 year ago

Hey there: if you just want to train as completion, you can just chuck it all per file to {text: "[STORY...."}

Though, please make sure on context length. How long are each of these? If they exceed your model's context, it may be discarded (https://github.com/OpenAccess-AI-Collective/axolotl/blob/c93655c0a3bd8d9c2b66f85bacc174ceb81de79f/src/axolotl/utils/data.py#L381).

SuperMasterBlasterLaser commented 1 year ago

@NanoCode012 Thank you for advise.

What is the maximum context length for LLama2-70B ? Aren't they 4K ?

NanoCode012 commented 1 year ago

Yes, 4k.

NanoCode012 commented 1 year ago

Please let me know if that answered the question, and we can close this Issue :)

SuperMasterBlasterLaser commented 1 year ago

Thank you for answer.

SuperMasterBlasterLaser commented 1 year ago

@NanoCode012 I forgot to ask one thing. I have giant raw text file which is about 30MB long. It is far longer than 4K context.

How to prepare dataset as json? Do I need to separate each story like this?

{
    "text1": "[ STORY_1 ].....",
    "text2": "[ STORY_1 ].....",
    "text3": "[ STORY_1 ].....",
}

Also each story have descriptions, do I need to repeat these descriptions when I separate one story into several parts?

NanoCode012 commented 1 year ago

Since completion is pre-taining, where it predicts next token, I would say just split it wherever it makes sense. We just need to experiment.

For ex,

[ STORY ]
Wonderful day

[ NARRATOR ]
This is wonderful day. Young boy CharA is walking down the road.

[ CharA ]
What a wonderful day. I hope I won't be a late.

[ END_STORY ]

SuperMasterBlasterLaser commented 1 year ago

@NanoCode012

So you mean the json should have that kind of structure:

{
    "text1": "[ STORY ] ....",
    "text2": "[ Char A ] ...."
}

NanoCode012 commented 1 year ago

{text: "[ STORY ] ...."}
{text: "[ Char A ] ...."}

NanoCode012 commented 1 year ago

One line per json. Each json only contains text key. This is just one example of how to split it. It depends on your case.

SuperMasterBlasterLaser commented 1 year ago

@NanoCode012 using duplicate keys for JSON is invalid, python's dict will just use last value and remove others.

NanoCode012 commented 1 year ago

@SuperMasterBlasterLaser , this is jsonlines (jsonl) where one line is one JSON. It just makes loading json easier :)

SuperMasterBlasterLaser commented 1 year ago

@NanoCode012 understood. Thanks for help.