Instructions vs. Questions in Instruction Fine-Tuning Dataset

OpenGVLab / Ask-Anything

[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.

MIT License

2.85k stars 230 forks source link

For the VQA examples it's very clear why we need to have a distinction between instruction and question. For datasets that only have 1 prompt, it's less clear to me why we should use the prompt as an "instruction" vs. "question" (e.g., CoCo Captioning).

LLMs (e.g. Vicuna) distinguish between system prompts and other prompts/instructions explicitly. By defining the task as an "instruction" within the dataset, we're inserting it as the LLM's system prompt.

Is there an advantage in doing so, over defining the task as a "question" within the dataset and thereby using it an instruction rather than system prompt? Positioning the task as a question seems more natural to me -- especially since we provide a system prompt during inference-time evaluation on MV Bench.

Thanks for your valuable suggestion~

As a researcher of computer vision, I do not have enough understanding of the system prompt and instruction. Thus I care more about how different settings will affect the VQA results, instead of how LLM treats different prompts.

In my experiments, I have tried different settings for captions tasks (no question), like: Instruction <INST><Image></Image></INST>Answer</s> or <INST><Image></Image> Instruction</INST>Answer</s>. However, the two prompts work similarly for our testing benchmarks.

OpenGVLab / Ask-Anything

Instructions vs. Questions in Instruction Fine-Tuning Dataset #185