Fine-tuning data template

We follow the JSON format in LLaVA when conducting our experiments. The format is as follows:

{"image": [path1, path2, path3], 
"conversations": ["from": "human", "value": "<image><image><image>\n text",
"from": "gpt", "value": "response"]},

If you use multiple images of different resolutions for input, then the number of the image placeholder () should be equal to the number of images. If you use video data or single image data, then just use one image placeholder and the code will put all the frames into the sequence.

Note that we pack the images into patches for our experiments, which is more efficient than loading image files one by one. We will release our packing method and data JSON in a few days.

Oryx-mllm / Oryx

Fine-tuning data template #4