How to use the modal-type embedding in the output of encoder?

dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

Apache License 2.0

1.36k stars 209 forks source link

Open rginjapan opened 2 years ago

rginjapan commented 2 years ago

Sorry, my questions is how can I use modal-type embedding to know which feature is belong to which modal in the output? Thanks in advance!!