2023.06.08. - Githubissues

Multimodal Neurons in Pretrained Text-Only Transformers

https://arxiv.org/pdf/2308.01544.pdf

"In 1688, William Molyneux posed a philosophical riddle to John Locke that has remained relevant to vision science for centuries: would a blind person, immediately upon gain- ing sight, visually recognize objects previously known only through another modality, such as touch [24, 30]? A pos- itive answer to the Molyneux Problem would suggest the existence a priori of ‘amodal’ representations of objects, common across modalities. In 2011, vision neuroscien- tists first answered this question in human subjects—no, im- mediate visual recognition is not possible—but crossmodal recognition capabilities are learned rapidly, within days af- ter sight-restoring surgery [15]. More recently, language- only artificial neural networks have shown impressive per- formance on crossmodal tasks when augmented with addi- tional modalities such as vision, using techniques that leave pretrained transformer weights frozen [40, 7, 25, 28, 18]."

Image prompts cast into the transformer embedding space do not encode interpretable semantics. Translation between modalities occurs inside the transformer.

csabaiBio / elte_ml_journal_club

2023.06.08. #131