facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
https://mmf.sh/
Other
5.46k stars 932 forks source link

FLAVA code #1219

Open sameeravithana opened 2 years ago

sameeravithana commented 2 years ago

In the original FLAVA paper [1], it cited MMF for the implementation. We want to check whether we can access the FLAVA implementation in this codebase.

[1] Singh, Amanpreet, et al. "FLAVA: A Foundational Language And Vision Alignment Model." arXiv preprint arXiv:2112.04482 (2021).

apsdehal commented 2 years ago

Hi,

The FLAVA codebase is on track to be released via torchmultimodal library. I will reply back to this issue by end of this week with further instructions.

PeterDykas commented 2 years ago

Why is there going to be two different repositories for multi-modal models. What is the difference going to be between TorchMultimodal and mmf?

kartikayk commented 2 years ago

Thanks for the question! We will have more detailed communication around this, but a quick note here. MMF currently supports text + image understanding tasks with some initial support for video understanding models added recently. We have received feedback from the community that MMF is slowly becoming over-engineered and the layers of inheritance is making it hard to use components outside of MMF. It’s also getting harder to add support for new tasks (eg: generation), support recent trends like model scaling, and extending to new modalities (audio for example).

As we rethink the Multimodal ecosystem in PyTorch, we will look to evolve MMF into a library for text + image understanding (refactor the models to be Pytorch components, deprecate the trainers and config systems etc) and provide more general support for combining modalities and tasks through TorchMultimodal. Our goal is to provide a collection of examples in TorchMultimodal that bring together components and infrastructure from all over the ecosystem, including MMF, for training multitask multimodal models at scale. As such, TorchMultimodal is designed with extensibility and composability in mind which makes adding new modalities (and tasks) or reusing components in other frameworks easy. The first example of this is the official release of FLAVA in TorchMultimodal. We don’t plan on adding this to MMF.

As I mentioned, we will share a more detailed communication around this soon!

PeterDykas commented 2 years ago

Thanks for the reply, that makes sense. Looking forward for the FLAVA implementation in TorchMultimodal .