harshsikka commented 1 month ago

Building on #53 and the discussion in the 07/21/2024 Meeting, we want to identify the 4 most important / best models to profile for Multinet.

See #53 for additional context around mdoels.

Context:

2 models that make a lot of sense are JAT and OpenVLA. Other candidates include SoTA LLMs like GPT4o or Claude, but would presumably be dependent on our ability to finetune, etc.

harshsikka commented 1 month ago

52 also seems to be a part of this

devjwsong commented 1 month ago

I agree that GPT-4o and Claude would be great starting points since they are publicly well-known, have multi-modalities (text + image), and also closed-source models with ease of usage.

For fine-tuning GPT-4o, we should request fine-tuning as an organization. Do we have our account on OpenAI API? Then I can submit the request form.
Unfortunately, Anthropic does not provide fine-tuning feature currently. Gemini models cannot be tuned either currently. We should contact directly to the companies.
If neither Claude and Gemini are available of fine-tuning, we should pick another one. I don't think the pure LLMs would work well with the control data, so we should consider other multi-modal models. I will think about one more suitable model from the paper I shared in #53.

devjwsong commented 1 month ago

Some thoughts on one last model here:

A pure LLM would not work well with the image observation. So prioritizing VLM would be better.
Since JAT and VLA were trained for actions, but GPT-4 is not, we can add one more models without action modalities:
- mPLUG-2: Modality-specific encoders + Text decoder.
- NExT-GPT: Modality-specific encoders + Modality-specific decoders. (eventually only using text...?)
- OneLLM: Unified modality encoder + Text decoder.
Or we can just use another action model:
- Unified-IO 2: Modality-specific encoders + Modality-specific decoders.

pranavguru commented 3 weeks ago

JAT, OpenVLA, GPT-4o, maybe Octo

ManifoldRG / MultiNet

Identify / Prioritize 4 models for Profiling on MultiNet #65

52 also seems to be a part of this