dvlab-research / LLaMA-VID

Official Implementation for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Apache License 2.0
618 stars 39 forks source link

Multi-image inference #71

Open g-h-chen opened 3 months ago

g-h-chen commented 3 months ago

Thanks for your great work! LLaMA-VID supports single-image input and video input, but does it support multi-image input? What's the quickest way to adapt to this input?

Thanks in advance!

yanwei-li commented 3 months ago

In current version, we do not support multi-image input. But you can support it by using multi-image instruction data like MIMIC-IT for instruction tuning.