Open jzhang38 opened 1 year ago
In Section 1:
"The most immediate way to perform Multimodal-CoT is to transform the input of different modalities into one modality and prompt LLMs to perform CoT. For example, it is possible to extract the caption of an image by a captioning model and then concatenate the caption with the original language input to be fed into LLMs (Lu et al., 2022a). However, there is severe information loss in the captioning process; thus, using the captions (as opposed to vision features) may suffer from a lack of mutual synergy in the representation space of different modalities."
For example:
The author didnt directly address the original poster, but I think his comments (original poster) make sense and seem to be valid/true
Dear authors,
Thanks for your exciting and solid work.
May I ask why Multimodal Chain-of-Thought is still significantly better than UnifiedQA when there is no visual input (e.g, the text context category and the no context category of ScienceQA)? I understand that one potential reason is your decoupled framework. But even without the decoupled framework (Table 5), Multimodal-CoT outperforms UnifiedQA (Table 4) by a large margin when both are evaluated on the no context category.
Besides, the "w/o Vision Features" setup in Table 5 sees a drastic decrease in the no context category. Does it mean questions without visual input also benefits from the model trained together with visual information?