In mmf/models/vilt.py: L158~L164, text embedding and image embedding are concatenated and feed forward through the encoder to get hidden states. However, exactly the sample process is done in L276~L282, which is called in L155. So the inputs are forwarded twice actually.
Expected behavior: remove redundant forward codes.
In
mmf/models/vilt.py: L158~L164
, text embedding and image embedding are concatenated and feed forward through the encoder to get hidden states. However, exactly the sample process is done inL276~L282
, which is called inL155
. So the inputs are forwarded twice actually. Expected behavior: remove redundant forward codes.