[Project] Flaminglet - Multimodal adapters for extending capabilities of PLMs

Project: What is your project name?

Flaminglet :)

Elevator Pitch: A short (max 1 paragraph) explanation of what your project is and why it is awesome.

The idea is simple: try to train tiny flamingo models on top of NeoX and evaluate the model's performance on visual-language tasks. It would probably not do generation but things like captioning, VQA, etc. should be possible. This is akin to a multimodal adapter approach, where we insert cross-attention layers between a visual backbone and a LM, at the LM side.

Goal Outputs: When this project is over, what things do you intend to have produced? An open source codebase? An academic paper? A new dataset?

A new paper along with a newly pretrained model to handle novel tasks

Milestones: At a high level, what are the major milestones of this project (including completed ones)? Which ones are completed and which ones are in progress?

Major milestones would be to (1) adjust the NeoX codebase to allow for flamingo like training, (2) identify potential ablations and interesting additions to the model architecture, (3) add multimodal benchmarks to eval harness for evaluation (including new multimodal reasoning benchmarks like winoground perhaps), and (4) train a bunch of models.

How to Help: If someone wanted to join the project, what would you like them to do?

Ehm, probably need help with everything :) This is just an idea right now and I have little exposure to the NeoX architecture / Flamingo implementation

Desired Support: What sort of funding, compute, data, or other materials do you need help obtaining to carry out your project? Will EAI be providing them? Do you have alternate sources of support? Feel free to add any additional relevant information.

We would need compute for finetuning a series of models, including ablations. I'm not certain what the requirements would be exactly but certainly less than finetuning a NeoX model. If we followed flamingo, the 'adapter' modules could even be around 850M parameters total, with also 1/2 and 1/4 of that number in possible 'every N layers' ablations. Even smaller initial ratios can be ablated as well.

EleutherAI / project-menu

[Project] Flaminglet - Multimodal adapters for extending capabilities of PLMs #50