List of models for profiling

pranavguru commented 1 month ago

Identify the final set of SoTA OS and closed-source models to profile on Multinet. 3 categories of models to look at:

LLM
VLM
VLA

harshsikka commented 1 month ago

Some proposed Initial models after some basic thought. Higher priority ones bolded

JAT OpenVLA gpt-4o-2024-05-13 claude-3-5-sonnet-20240620 gemini-1.5-pro-api-0514 gemma-2-27b-it gemini-1.5-flash-api-0514 meta-llama-3-70b-instruct mistral-large-2402 RT-2-X Octo

Note: some of these models are closed source and sit behind APIs. Some, like RT-2-X may not be available at all. They may not have finetuning functionality offered, or may require us to reach out and request special access.

devjwsong commented 1 month ago

Got. We can start researching those models first.

Also, I found a recent survey paper on generalist multi-modal models: https://arxiv.org/pdf/2406.05496. I think we can also consider the models listed on page 17 if this benchmark is focused on the wider multi-modalities.

snat-s commented 1 month ago

i think its a really good idea to show all the other multimodalities, i'm down to do that, its a really solid list of models. here is the version of the list if anyones lazy and doesn't want to click the link: Here's the table formatted as markdown: Certainly! I'll add these models to the existing table. For the models where modalities aren't specified, I'll leave those cells blank. Here's the updated table in markdown format:

Model	Modalities	Access	Fine-tuning	Cost (Inference / Only for input tokens)
JAT	Text, Image, Action, Reward	https://github.com/huggingface/jat	https://github.com/huggingface/jat	0.17M (based on together.ai pricing)
OpenVLA	Text, Image, Action	https://github.com/openvla/openvla	https://github.com/openvla/openvla	0.34M (based on together.ai pricing)
gpt-4o-2024-05-13	Text, (Audio), Image, (Video)	https://platform.openai.com/docs/models/gpt-4o	Additional request needed. (https://share.hsforms.com/1PsPBTQDRTM6NUZp9J08oCg4sk30)	$8.50M
claude-3-5-sonnet-20240620	Text, Image	https://docs.anthropic.com/en/docs/about-claude/models	-	$5.10M
gemini-1.5-pro-api-0514	Text, Image, Audio, Video	https://ai.google.dev/gemini-api/docs/models/gemini#gemini-1.5-pro	-	$0.60M → Expected to be lower with context caching.
gemma-2-27b-it	Text	https://ai.google.dev/gemma/docs/model_card_2	https://ai.google.dev/gemma/docs/lora_tuning	$5.07 per hour. (Nvidia A100 80GB @ GCP)
gemini-1.5-flash-api-0514	Text, Image, Audio, Video	https://ai.google.dev/gemini-api/docs/models/gemini#gemini-1.5-flash	(Coming soon…?)	$0.60M → Expected to be lower with context caching.
meta-llama-3-70b-instruct	Text	https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct	https://llama.meta.com/docs/how-to-guides/fine-tuning/	$1.53M (https://artificialanalysis.ai/models/llama-3-instruct-70b)
mistral-large-2402	Text	https://docs.mistral.ai/getting-started/models/	https://docs.mistral.ai/capabilities/finetuning/	$6.8M
RT-2-X	Text, Image, Action	-	-	-
Octo	Text, Image, Action	https://github.com/octo-models/octo	https://github.com/octo-models/octo	0.17M (based on together.ai pricing)
Uni-Perceiver	Text, Image, Video	https://github.com/fundamentalvision/Uni-Perceiver	https://github.com/fundamentalvision/Uni-Perceiver/blob/main/data/finetuning.md	0.0272M (based on together.ai pricing)
Unified-IO 2	Text, Image, Video, Audio, Action	https://github.com/allenai/unified-io-2	https://github.com/allenai/unified-io-2	0.34M (based on together.ai pricing)
GATO	Text, Image, Video, Audio, Action	-	-	-
OFA+	Text, Image, Video, Audio	-	-	-
mPLUG-2	Text, Image, Video, Audio	https://github.com/X-PLUG/mPLUG-2	https://github.com/X-PLUG/mPLUG-2 (A few tasks...?)	0.17M (based on together.ai pricing)
Meta-Transformer	Text, Image, Video, Audio, Point cloud, Graph, Time series, Table	https://github.com/invictus717/MetaTransformer	https://github.com/invictus717/MetaTransformer	0.0272M (based on together.ai pricing)
NEXT-GPT	Text, Image, Video, Audio	https://github.com/NExT-GPT/NExT-GPT	https://github.com/NExT-GPT/NExT-GPT	0.34M (based on together.ai pricing)
OneLLM	Text, Image, Audio, Video, Point cloud	https://github.com/csuhan/OneLLM	https://github.com/csuhan/OneLLM	0.34M (based on together.ai pricing)

devjwsong commented 1 month ago

@snat-s Thanks. I added more columns and contents for several models I've researched so far. The cost part might not be accurate though. I've not researched the models with empty cells yet.

Let me know if the table format is fine or needs to be adjusted.

pranavguru commented 4 weeks ago

Closing because same as #65

ManifoldRG / MultiNet

List of models for profiling #53