TempleX98 / MoVA

[NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context
Apache License 2.0
129 stars 1 forks source link

How do you handle multi-round conversations in train and inference stage ? #1

Closed laserwave closed 4 months ago

laserwave commented 6 months ago

Nice work. And I have a question regarding how you handle the multi-round conversations in both training and inference stage. Do you have to extract the feature of image once again? As the following question may require ability of different experts.

And the circumstances become complicated when it comes to multi-image comprehension as different images may need different experts. For example,

According to the text in image1 <image> and the region [0.3, 0.2, 0.5, 0.4] of image2 <image>, what can we infer.

TempleX98 commented 5 months ago

Training: Given an image $I$ and its $N$-round instructions, we concatenate each instruction $Q_{i}$ to a single instruction $Q$. And we assign the routing result of the single question to this training sample.

Inference: In the multi-round conversation scenario, the used expert feature can be cached. If the assigned vision expert is activated before, we just input the cached feature and the current instruction to the MoV-Adapter. If a vision expert is never activated but is assigned for current instruction, we need to extract its features during this forward pass.