facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
https://mmf.sh/
Other
5.49k stars 935 forks source link

[feat] Add ViLT image and text embeddings #1094

Closed Ryan-Qiyu-Jiang closed 3 years ago

Ryan-Qiyu-Jiang commented 3 years ago

Add ViT image and Bert text embedding encoders for ViLT model. Embedding encoders are processors for image and text data taking as input sample_lists. Outputs are concatenated and fed to the ViLT trunk.

Add ViTEncoder with runtime import dependency added in future diff. Used in future ViLT model diff.

Bump version of transformers to 4.5.1 to enable vit module.

Diff # 2

facebook-github-bot commented 3 years ago

@Ryan-Qiyu-Jiang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.