Hi, thanks for your effort in making the new mPLUG-DOCOWL2 open source. I'm currently trying to understand how it works. Referring to the image below, Section 3 states that there is a low-resolution vision encoder and an H-reducer. However, I can't find either of those components in the image, or they aren't depicted clearly. Could someone clarify the overall architecture? Apologies if this is a basic question!
Hi, thanks for your effort in making the new mPLUG-DOCOWL2 open source. I'm currently trying to understand how it works. Referring to the image below, Section 3 states that there is a low-resolution vision encoder and an H-reducer. However, I can't find either of those components in the image, or they aren't depicted clearly. Could someone clarify the overall architecture? Apologies if this is a basic question!