Oryx-mllm / Oryx

MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
https://oryx-mllm.github.io
291 stars 14 forks source link

Fine-tuning data template #4

Closed Gary-code closed 2 months ago

Gary-code commented 2 months ago

Could you provide an example of a JSON template for fine-tuning multiple images?

liuzuyan commented 2 months ago

We follow the JSON format in LLaVA when conducting our experiments. The format is as follows:

{"image": [path1, path2, path3], 
"conversations": ["from": "human", "value": "<image><image><image>\n text",
"from": "gpt", "value": "response"]},

If you use multiple images of different resolutions for input, then the number of the image placeholder () should be equal to the number of images. If you use video data or single image data, then just use one image placeholder and the code will put all the frames into the sequence.

Note that we pack the images into patches for our experiments, which is more efficient than loading image files one by one. We will release our packing method and data JSON in a few days.