dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
Apache License 2.0
1.36k stars 209 forks source link

How to use the modal-type embedding in the output of encoder? #67

Open rginjapan opened 2 years ago

rginjapan commented 2 years ago

Sorry, my questions is how can I use modal-type embedding to know which feature is belong to which modal in the output? Thanks in advance!!