ManifoldRG / MultiNet

MIT License
8 stars 1 forks source link

List of models for profiling #53

Closed pranavguru closed 4 weeks ago

pranavguru commented 1 month ago

Identify the final set of SoTA OS and closed-source models to profile on Multinet. 3 categories of models to look at:

harshsikka commented 1 month ago

Some proposed Initial models after some basic thought. Higher priority ones bolded

JAT OpenVLA gpt-4o-2024-05-13 claude-3-5-sonnet-20240620 gemini-1.5-pro-api-0514 gemma-2-27b-it gemini-1.5-flash-api-0514 meta-llama-3-70b-instruct mistral-large-2402 RT-2-X Octo

Note: some of these models are closed source and sit behind APIs. Some, like RT-2-X may not be available at all. They may not have finetuning functionality offered, or may require us to reach out and request special access.

devjwsong commented 1 month ago

Got. We can start researching those models first.

Also, I found a recent survey paper on generalist multi-modal models: https://arxiv.org/pdf/2406.05496. I think we can also consider the models listed on page 17 if this benchmark is focused on the wider multi-modalities.

snat-s commented 1 month ago

i think its a really good idea to show all the other multimodalities, i'm down to do that, its a really solid list of models. here is the version of the list if anyones lazy and doesn't want to click the link: Here's the table formatted as markdown: Certainly! I'll add these models to the existing table. For the models where modalities aren't specified, I'll leave those cells blank. Here's the updated table in markdown format:

Model Modalities Access Fine-tuning Cost (Inference / Only for input tokens)
JAT Text, Image, Action, Reward https://github.com/huggingface/jat https://github.com/huggingface/jat 0.17M (based on together.ai pricing)
OpenVLA Text, Image, Action https://github.com/openvla/openvla https://github.com/openvla/openvla 0.34M (based on together.ai pricing)
gpt-4o-2024-05-13 Text, (Audio), Image, (Video) https://platform.openai.com/docs/models/gpt-4o Additional request needed. (https://share.hsforms.com/1PsPBTQDRTM6NUZp9J08oCg4sk30) $8.50M
claude-3-5-sonnet-20240620 Text, Image https://docs.anthropic.com/en/docs/about-claude/models - $5.10M
gemini-1.5-pro-api-0514 Text, Image, Audio, Video https://ai.google.dev/gemini-api/docs/models/gemini#gemini-1.5-pro - $0.60M → Expected to be lower with context caching.
gemma-2-27b-it Text https://ai.google.dev/gemma/docs/model_card_2 https://ai.google.dev/gemma/docs/lora_tuning $5.07 per hour. (Nvidia A100 80GB @ GCP)
gemini-1.5-flash-api-0514 Text, Image, Audio, Video https://ai.google.dev/gemini-api/docs/models/gemini#gemini-1.5-flash (Coming soon…?) $0.60M → Expected to be lower with context caching.
meta-llama-3-70b-instruct Text https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct https://llama.meta.com/docs/how-to-guides/fine-tuning/ $1.53M (https://artificialanalysis.ai/models/llama-3-instruct-70b)
mistral-large-2402 Text https://docs.mistral.ai/getting-started/models/ https://docs.mistral.ai/capabilities/finetuning/ $6.8M
RT-2-X Text, Image, Action - - -
Octo Text, Image, Action https://github.com/octo-models/octo https://github.com/octo-models/octo 0.17M (based on together.ai pricing)
Uni-Perceiver Text, Image, Video https://github.com/fundamentalvision/Uni-Perceiver https://github.com/fundamentalvision/Uni-Perceiver/blob/main/data/finetuning.md 0.0272M (based on together.ai pricing)
Unified-IO 2 Text, Image, Video, Audio, Action https://github.com/allenai/unified-io-2 https://github.com/allenai/unified-io-2 0.34M (based on together.ai pricing)
GATO Text, Image, Video, Audio, Action - - -
OFA+ Text, Image, Video, Audio - - -
mPLUG-2 Text, Image, Video, Audio https://github.com/X-PLUG/mPLUG-2 https://github.com/X-PLUG/mPLUG-2 (A few tasks...?) 0.17M (based on together.ai pricing)
Meta-Transformer Text, Image, Video, Audio, Point cloud, Graph, Time series, Table https://github.com/invictus717/MetaTransformer https://github.com/invictus717/MetaTransformer 0.0272M (based on together.ai pricing)
NEXT-GPT Text, Image, Video, Audio https://github.com/NExT-GPT/NExT-GPT https://github.com/NExT-GPT/NExT-GPT 0.34M (based on together.ai pricing)
OneLLM Text, Image, Audio, Video, Point cloud https://github.com/csuhan/OneLLM https://github.com/csuhan/OneLLM 0.34M (based on together.ai pricing)
devjwsong commented 1 month ago

@snat-s Thanks. I added more columns and contents for several models I've researched so far. The cost part might not be accurate though. I've not researched the models with empty cells yet.

Let me know if the table format is fine or needs to be adjusted.

pranavguru commented 4 weeks ago

Closing because same as #65