amazon-science / mm-cot

Official implementation for "Multimodal Chain-of-Thought Reasoning in Language Models" (stay tuned and more will be updated)
https://arxiv.org/abs/2302.00923
Apache License 2.0
3.77k stars 309 forks source link

How to use the mm-cot frame as a utility library through local LLM? #73

Open dszpr opened 8 months ago

dszpr commented 8 months ago

Hi! Much appreciated for the excellent work!

I am working on vision-QA task using BLIP2, which consists of three modules: ViT that extracting vision feature QFORMER that narrow the gap between vision and language modalities T5xxl that receive the question and the output of QFORMER to generate answers.

I wonder if it's possible to employ the mm-cot as a utility library in BLIP2 model to enhance vision-QA inference?

cooelf commented 4 months ago

Hi, thanks for your interest! An efficient way could be training your framework just in two steps like MM-CoT: (i) rationale generation; (ii) answer inference; no matter the backbone modules are.