Clarification on Architecture Components in mPLUG-DOCOWL2

X-PLUG / mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Apache License 2.0

1.34k stars 84 forks source link

Clarification on Architecture Components in mPLUG-DOCOWL2 #112

Open FeliceSchena opened 1 week ago

FeliceSchena commented 1 week ago

Hi, thanks for your effort in making the new mPLUG-DOCOWL2 open source. I'm currently trying to understand how it works. Referring to the image below, Section 3 states that there is a low-resolution vision encoder and an H-reducer. However, I can't find either of those components in the image, or they aren't depicted clearly. Could someone clarify the overall architecture? Apologies if this is a basic question!

HAWLYQ commented 5 days ago

Hi, @FeliceSchena , the low-resolution vision encoder and the HReducer are included in the High-resolution Visual Encoding module in this figure~