Closed SuperMasterBlasterLaser closed 1 year ago
Hey there: if you just want to train as completion
, you can just chuck it all per file to {text: "[STORY...."}
Though, please make sure on context length. How long are each of these? If they exceed your model's context, it may be discarded (https://github.com/OpenAccess-AI-Collective/axolotl/blob/c93655c0a3bd8d9c2b66f85bacc174ceb81de79f/src/axolotl/utils/data.py#L381).
@NanoCode012 Thank you for advise.
What is the maximum context length for LLama2-70B ? Aren't they 4K ?
Yes, 4k.
Please let me know if that answered the question, and we can close this Issue :)
Thank you for answer.
@NanoCode012 I forgot to ask one thing. I have giant raw text file which is about 30MB long. It is far longer than 4K context.
How to prepare dataset as json? Do I need to separate each story like this?
{
"text1": "[ STORY_1 ].....",
"text2": "[ STORY_1 ].....",
"text3": "[ STORY_1 ].....",
}
Also each story have descriptions, do I need to repeat these descriptions when I separate one story into several parts?
Since completion is pre-taining, where it predicts next token, I would say just split it wherever it makes sense. We just need to experiment.
For ex,
[ STORY ]
Wonderful day
[ NARRATOR ]
This is wonderful day. Young boy CharA is walking down the road.
[ CharA ]
What a wonderful day. I hope I won't be a late.
[ END_STORY ]
@NanoCode012
So you mean the json should have that kind of structure:
{
"text1": "[ STORY ] ....",
"text2": "[ Char A ] ...."
}
{text: "[ STORY ] ...."}
{text: "[ Char A ] ...."}
One line per json. Each json only contains text
key. This is just one example of how to split it. It depends on your case.
@NanoCode012 using duplicate keys for JSON is invalid, python's dict will just use last value and remove others.
@SuperMasterBlasterLaser , this is jsonlines (jsonl
) where one line is one JSON. It just makes loading json easier :)
@NanoCode012 understood. Thanks for help.
Hello everyone.
According to this part of docs, training LoRAs require using dataset as JSON file.
The problem is that my dataset is raw text file with it's own structure. It looks like this:
On oobaboogas web ui, it has training LoRA on raw text file, were I just assign separation symbol. How to train LoRA by using this lib?