Why Multimodal Chain-of-Thought is stil significantly better than UnifiedQA when there is no visual input?

amazon-science / mm-cot

Official implementation for "Multimodal Chain-of-Thought Reasoning in Language Models" (stay tuned and more will be updated)

https://arxiv.org/abs/2302.00923

Apache License 2.0

3.8k stars 314 forks source link

Why Multimodal Chain-of-Thought is stil significantly better than UnifiedQA when there is no visual input? #7

Open jzhang38 opened 1 year ago

jzhang38 commented 1 year ago

Dear authors,

Thanks for your exciting and solid work.

May I ask why Multimodal Chain-of-Thought is still significantly better than UnifiedQA when there is no visual input (e.g, the text context category and the no context category of ScienceQA)? I understand that one potential reason is your decoupled framework. But even without the decoupled framework (Table 5), Multimodal-CoT outperforms UnifiedQA (Table 4) by a large margin when both are evaluated on the no context category.

Besides, the "w/o Vision Features" setup in Table 5 sees a drastic decrease in the no context category. Does it mean questions without visual input also benefits from the model trained together with visual information?

jzhang38 commented 1 year ago

In the paper, you mentioned "during training, the models in both stages are trained independently". Do you mean a single model trained on the two stages sequentially or two different models are trained? If it is the former, won't this cause serious catastrophic forgetting issues? If it is the latter, will it be fairer to compare Multimodal-CoT to larger models because Multimodal-CoT uses two times the parameters?

astonzhang commented 1 year ago

In Section 1:

"The most immediate way to perform Multimodal-CoT is to transform the input of different modalities into one modality and prompt LLMs to perform CoT. For example, it is possible to extract the caption of an image by a captioning model and then concatenate the caption with the original language input to be fed into LLMs (Lu et al., 2022a). However, there is severe information loss in the captioning process; thus, using the captions (as opposed to vision features) may suffer from a lack of mutual synergy in the representation space of different modalities."

For example:

jboverio commented 1 year ago

The author didnt directly address the original poster, but I think his comments (original poster) make sense and seem to be valid/true