Question about M4-Instruct datasets

Thank your for your kindly release!

But when i looking at the annotations of M4-Instruct, the FIRST sample just quite confused me. Here is the snapshot:

The human and GPT value seem to be wrong. Obviously it should be "human value" first and giving an instruction with multiple images. But in this sample, instruction is given by GPT, and answer is given by human with images.

Looking forward to your reply.

LLaVA-VL / LLaVA-NeXT

Question about M4-Instruct datasets #89