facebookresearch / multimodal

TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.
BSD 3-Clause "New" or "Revised" License
1.43k stars 138 forks source link

Incremental addition of the new modality #390

Open averkij opened 1 year ago

averkij commented 1 year ago

🚀 The feature, motivation and pitch

🤗 Hello! Thank you for your work!

I see model configurations which working with certain modalities in this repo and it is great.

I have a question though, what if I have pretrained encoder for other modality (e.g. for audio) and a data for training (audio-text pairs and audio-image pairs).

Alternatives

No response

Additional context

It will be great if the user that have N pretrained encoders for arbitrary modalities will be able to pass them to some fuser model and train it to solve cross modal tasks. Or add the new modality to existing model.

ebsmothers commented 1 year ago

Hi @averkij thanks for using the library. Can you share more specifics of the task you're working on? That way we can hopefully give more detailed and informative answers.

How can I train a model which will be able to solve tasks with my new modality?

I guess you are talking about co-learning (or something similar)? But again if you can provide more specifics that'll be helpful.

In other words, which components I should use to fuse new modality with other ones? Should I implement a new model or I can use existed components as fusers?

For fusing different modalities, we provide some generic components for fusion which can be found here.

It will be great if the user that have N pretrained encoders for arbitrary modalities will be able to pass them to some fuser model and train it to solve cross modal tasks. Or add the new modality to existing model.

With the fusion modules above you should hopefully be able to do this without much trouble. They all take in Dict[str, Tensor], so you just need to put each encoder's outputs into a dict then pass to the fusion. You can also see late_fusion.py, which provides a general way to set up this type of architecture.

averkij commented 1 year ago

Hello, @ebsmothers. Thank you for reply. Let me be more specific.

I have three unimodal encoders for different modalities (text, image, audio), which translate data to sequences. I also have datasets for different tasks across these three modalities. I want to make and train one model which will be able to solve such tasks (like image captioning, ASR, audio classification, image generation, etc).