OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
https://internvl.readthedocs.io/en/latest/
MIT License
5.75k stars 451 forks source link

Reason for chosen LLM from NousResearch #407

Closed royzhang12 closed 2 months ago

royzhang12 commented 2 months ago

Hi @czczup,

Thanks for providing such a wonderful job. I have noticed that you use two models from NousResearch as your LLM backbone. [Nous‑Hermes‑2‑Yi‑34B] and [Hermes‑2‑Theta‑Llama‑3‑70B]. Would you mind to share some insight of why choosing these two instead of other model such as Qwen2, which seems perferming bettter in some LLM benchmarks.

Best regards

czczup commented 2 months ago

Qwen2 is indeed very powerful. However, due to certain specific reasons, we are currently unable to open-source a super large multimodal model based on Qwen.

Therefore, we opted for NousResearch's Hermes-2-Theta-Llama-3-70B instead (mainly because it is a Llama-3 model). Based on my experiments, this model outperforms the official Llama3-70B. From the benchmarks, Hermes-2-Theta-Llama-3-70B primarily falls short compared to Qwen2 in scenarios involving the Chinese language.

royzhang12 commented 2 months ago

@czczup Thanks for the swift reply and sharing, would you mind to also share some experience about which LLMs benchmark and metrics you cares most when choose LLM backbones?

VietDunghacker commented 2 months ago

@czczup Do you guys have any plan adopting new SOTA LLM backbone such as Gemma or recently Llama 3.1 for InternVL?

czczup commented 2 months ago

@czczup Thanks for the swift reply and sharing, would you mind to also share some experience about which LLMs benchmark and metrics you cares most when choose LLM backbones?

MMLU and lmsys chatbot arena are the benchmarks I care about

czczup commented 2 months ago

@czczup Do you guys have any plan adopting new SOTA LLM backbone such as Gemma or recently Llama 3.1 for InternVL?

I am considering using Llama 3.1, but without further improvements in the training data (i.e., using the same data as the currently released models), the enhancement in user experience or dialogue might be quite limited. This is my main concern.