Structure of dataset for finetuning

X-PLUG / mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family

https://www.modelscope.cn/studios/damo/mPLUG-Owl

MIT License

2.25k stars 171 forks source link

Structure of dataset for finetuning #82

Closed lambertjf closed 1 year ago

lambertjf commented 1 year ago

Hello, I'm trying to finetune the model using my own dataset but I can't figure out exactly what I need to have. It would help to have a direct example of what the data needs to look like because I can't figure it out based on what the README says. I have formatted the examples in the jsonl form that the README describes however I'm running into issues.

Is there any way you guys could post the files referred to in the config (sft_v0.1_train.jsonl and sft_v0.1_dev.jsonl) or even just a shortened version of them with only a few examples?

MAGAer13 commented 1 year ago

For the pure text instruction data, you can refer to

{"text": "The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\nHuman: Construct a sentence using the given verb in the past tense\nshudder\nAI: She shuddered at the thought of being alone in the dark.", "task_type": "gpt4instruct_sft"}

where the text after "Human" is the instruction, and the one after "AI: " is the resposne.

For image-text instruction data, the text span remains same with additional placeholder "" for image, and a image key that contains the path of the images for the placeholder.

KooSung commented 1 year ago

@MAGAer13 Could you share the script for pre-processing different categories of data? I am very confused about how to handle different task type. Thanks~

lambertjf commented 1 year ago

I understand how the data should look in the file, I'm just confused as to why there needs to be two files, train.jsonl and dev.jsonl. Which file should have what you described in it? And what should the other file contain?

wang9danzuishuai commented 1 year ago

@lambertjf Hello lam. I just have the same question about these two files. Have you figured out what is in this two files? Thanks! :D

MAGAer13 commented 1 year ago

The structure in dev.jsonl is identical with train.jsonl. The task_type is just for tagging only, you only need to format like the template xxxx_sft.