Can we use in-context multimodal data for finetuning?

OpenBMB / MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone

Apache License 2.0

7.86k stars 547 forks source link

Can we use in-context multimodal data for finetuning? #237

Open waltonfuture opened 1 month ago

waltonfuture commented 1 month ago

Thanks for your great work! However, it seems that we can only use data that contains one image for SFT. Can we use in-context multimodal data (i.e., containing multiple images) for finetuning?

qyc-98 commented 1 month ago

yes, the code supports multi-image finetuning

waltonfuture commented 1 month ago

yes, the code supports multi-image finetuning

Thank you. How should I organize my data for multi-image sft? And how to inference with multi-image?

haochuan-li commented 3 weeks ago

Same problem here. Any update on multi-image sft?

waltonfuture commented 3 weeks ago

@qyc-98 Hello! Can you provide some simple examples of in-context inference or SFT? Thanks a lot!

1SingleFeng commented 3 weeks ago

@qyc-98 I have encountered the same problem. Have you resolved it

pbarker commented 1 week ago

+1 also curious about this