ManifoldRG / MultiNet

MIT License
9 stars 1 forks source link

Identify / Prioritize 4 models for Profiling on MultiNet #65

Closed harshsikka closed 3 weeks ago

harshsikka commented 1 month ago

Building on #53 and the discussion in the 07/21/2024 Meeting, we want to identify the 4 most important / best models to profile for Multinet.

See #53 for additional context around mdoels.

Context:

harshsikka commented 1 month ago

52 also seems to be a part of this

devjwsong commented 1 month ago

I agree that GPT-4o and Claude would be great starting points since they are publicly well-known, have multi-modalities (text + image), and also closed-source models with ease of usage.

  1. For fine-tuning GPT-4o, we should request fine-tuning as an organization. Do we have our account on OpenAI API? Then I can submit the request form.
  2. Unfortunately, Anthropic does not provide fine-tuning feature currently. Gemini models cannot be tuned either currently. We should contact directly to the companies.
  3. If neither Claude and Gemini are available of fine-tuning, we should pick another one. I don't think the pure LLMs would work well with the control data, so we should consider other multi-modal models. I will think about one more suitable model from the paper I shared in #53.
devjwsong commented 1 month ago

Some thoughts on one last model here:

  1. A pure LLM would not work well with the image observation. So prioritizing VLM would be better.
  2. Since JAT and VLA were trained for actions, but GPT-4 is not, we can add one more models without action modalities:
    • mPLUG-2: Modality-specific encoders + Text decoder.
    • NExT-GPT: Modality-specific encoders + Modality-specific decoders. (eventually only using text...?)
    • OneLLM: Unified modality encoder + Text decoder.
  3. Or we can just use another action model:
    • Unified-IO 2: Modality-specific encoders + Modality-specific decoders.
pranavguru commented 3 weeks ago

JAT, OpenVLA, GPT-4o, maybe Octo