Closed lambertjf closed 1 year ago
For the pure text instruction data, you can refer to
{"text": "The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\nHuman: Construct a sentence using the given verb in the past tense\nshudder\nAI: She shuddered at the thought of being alone in the dark.", "task_type": "gpt4instruct_sft"}
where the text after "Human" is the instruction, and the one after "AI: " is the resposne.
For image-text instruction data, the text span remains same with additional placeholder "
@MAGAer13 Could you share the script for pre-processing different categories of data? I am very confused about how to handle different task type. Thanks~
I understand how the data should look in the file, I'm just confused as to why there needs to be two files, train.jsonl and dev.jsonl. Which file should have what you described in it? And what should the other file contain?
@lambertjf Hello lam. I just have the same question about these two files. Have you figured out what is in this two files? Thanks! :D
The structure in dev.jsonl
is identical with train.jsonl
. The task_type
is just for tagging only, you only need to format like the template xxxx_sft
.
Hello, I'm trying to finetune the model using my own dataset but I can't figure out exactly what I need to have. It would help to have a direct example of what the data needs to look like because I can't figure it out based on what the README says. I have formatted the examples in the jsonl form that the README describes however I'm running into issues.
Is there any way you guys could post the files referred to in the config (sft_v0.1_train.jsonl and sft_v0.1_dev.jsonl) or even just a shortened version of them with only a few examples?