Add ViT image and Bert text embedding encoders for ViLT model.
Embedding encoders are processors for image and text data taking as input sample_lists.
Outputs are concatenated and fed to the ViLT trunk.
Add ViTEncoder with runtime import dependency added in future diff.
Used in future ViLT model diff.
Bump version of transformers to 4.5.1 to enable vit module.
Add ViT image and Bert text embedding encoders for ViLT model. Embedding encoders are processors for image and text data taking as input sample_lists. Outputs are concatenated and fed to the ViLT trunk.
Add ViTEncoder with runtime import dependency added in future diff. Used in future ViLT model diff.
Bump version of transformers to 4.5.1 to enable vit module.
Diff # 2