Closed schopra8 closed 3 weeks ago
Thanks for your valuable suggestion~
As a researcher of computer vision, I do not have enough understanding of the system prompt
and instruction
. Thus I care more about how different settings will affect the VQA results, instead of how LLM treats different prompts.
In my experiments, I have tried different settings for captions tasks (no question
), like:
Instruction <INST><Image></Image></INST>Answer</s>
or
<INST><Image></Image> Instruction</INST>Answer</s>
.
However, the two prompts work similarly for our testing benchmarks.
Makes sense, thank you!
For the VQA examples it's very clear why we need to have a distinction between instruction and question. For datasets that only have 1 prompt, it's less clear to me why we should use the prompt as an "instruction" vs. "question" (e.g., CoCo Captioning).
LLMs (e.g. Vicuna) distinguish between system prompts and other prompts/instructions explicitly. By defining the task as an "instruction" within the dataset, we're inserting it as the LLM's system prompt.
Is there an advantage in doing so, over defining the task as a "question" within the dataset and thereby using it an instruction rather than system prompt? Positioning the task as a question seems more natural to me -- especially since we provide a system prompt during inference-time evaluation on MV Bench.