Closed bbeatrix closed 1 year ago
Multimodal Neurons in Pretrained Text-Only Transformers
https://arxiv.org/pdf/2308.01544.pdf
"In 1688, William Molyneux posed a philosophical riddle to John Locke that has remained relevant to vision science for centuries: would a blind person, immediately upon gain- ing sight, visually recognize objects previously known only through another modality, such as touch [24, 30]? A pos- itive answer to the Molyneux Problem would suggest the existence a priori of ‘amodal’ representations of objects, common across modalities. In 2011, vision neuroscien- tists first answered this question in human subjects—no, im- mediate visual recognition is not possible—but crossmodal recognition capabilities are learned rapidly, within days af- ter sight-restoring surgery [15]. More recently, language- only artificial neural networks have shown impressive per- formance on crossmodal tasks when augmented with addi- tional modalities such as vision, using techniques that leave pretrained transformer weights frozen [40, 7, 25, 28, 18]."
Image prompts cast into the transformer embedding space do not encode interpretable semantics. Translation between modalities occurs inside the transformer.
Dear All!
I completely forgot to write earlier, but as the summer has arrived, we agreed upon having meetings only when there are volunteering presenters. Let us know under this issue if any of you is willing to present something at some point. If there are no volunteering presenters, each week's journal club is automatically canceled, I won't write separate notifications for that.
Best wishes and restful summer! Bea