URL

https://arxiv.org/abs/2405.02246
Affiliations
- Hugo Laurençon, N/A
- Léo Tronchon, N/A
- Matthieu Cord, N/A
- Victor Sanh, N/A
  Abstract
- The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.
  Translation (by gpt-4o-mini)
視覚と言語のモデル（VLM）への関心の高まりは、大規模言語モデルと視覚トランスフォーマーの改善によって促進されています。このテーマに関する文献は豊富ですが、VLMの設計に関する重要な決定がしばしば正当化されていないことに気付きます。私たちは、これらの裏付けのない決定が、どの選択がモデルの性能を向上させるかを特定するのを難しくすることで、分野の進展を妨げていると主張します。この問題に対処するために、事前学習済みモデル、アーキテクチャの選択、データ、トレーニング手法に関する広範な実験を行いました。私たちの研究成果の統合には、80億パラメータを持つ効率的な基盤VLMであるIdefics2の開発が含まれています。Idefics2は、さまざまなマルチモーダルベンチマークにおいて、そのサイズカテゴリ内で最先端の性能を達成し、しばしば4倍のサイズのモデルと同等の性能を示します。私たちは、モデル（ベース、指示付き、チャット）とそのトレーニングのために作成したデータセットを公開します。
Summary (by gpt-4o-mini)
視覚と言語のモデル（VLM）の設計における裏付けのない決定が性能向上の特定を妨げていると指摘。事前学習済みモデルやアーキテクチャ、データ、トレーニング手法に関する実験を行い、80億パラメータの基盤VLM「Idefics2」を開発。Idefics2はマルチモーダルベンチマークで最先端の性能を達成し、4倍のサイズのモデルと同等の性能を示す。モデルとデータセットを公開。

AkihikoWatanabe / paper_notes

What matters when building vision-language models?, Hugo Laurençon+, N/A, arXiv'24 #1434

URL

Affiliations

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)