Molmo is a family of open state-of-the-art multimodal AI models. Our most powerful model closes the gap between open and proprietary systems across a wide range of academic benchmarks as well as human evaluations. Our smaller models outperform models 10x their size. While current multimodal models interpret multimodal data and express it in natural language, their full potential remains untapped. Molmo goes beyond. By learning to point at what it perceives, Molmo enables rich interactions with physical and virtual worlds, empowering the next generation of applications capable of acting and interacting with their environments.Today's most advanced multimodal models remain proprietary. Research efforts aimed at building vision-language models (VLMs) utilizing open data lag significantly behind this state-of-the-art. Recent stronger open-weights models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of state-of-the-art VLMs. Starting from a pre-trained vision encoder (CLIP) and language-only LLMs, the entire remainder of our VLM pipeline – weights, code, data, and evaluations – is open and free from VLM distillation. Our key innovation is a novel, highly-detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of capabilities, we also introduce a diverse dataset mixture for fine-tuning. This includes innovative 2D pointing data that enables Molmo to answer questions not just using natural language but also using non verbal cues. We believe this opens up important future directions for VLMs enabling agents to interact in virtual and physical worlds. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and most critically the quality of our newly collected datasets, all of which will be released. The best in class model within the Molmo family not only outperforms others in the class of open weight and data models, but also compares favorably against proprietary systems like GPT-4o, Claude 3.5 and Gemini 1.5. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and a public demo (using Molmo-7B-D model) are available starting today.

Translation (by gpt-4o-mini)

Molmoは、最先端のマルチモーダルAIモデルのファミリーです。私たちの最も強力なモデルは、幅広い学術的ベンチマークや人間の評価において、オープンシステムとプロプライエタリシステムのギャップを埋めます。私たちの小型モデルは、サイズが10倍のモデルを上回る性能を発揮します。現在のマルチモーダルモデルは、マルチモーダルデータを解釈し、それを自然言語で表現しますが、その潜在能力は十分に活用されていません。Molmoはそれを超えています。Molmoは、自身が認識するものを指し示すことを学ぶことで、物理的および仮想的な世界との豊かなインタラクションを可能にし、環境と相互作用し行動する能力を持つ次世代アプリケーションを実現します。今日の最も進んだマルチモーダルモデルは依然としてプロプライエタリです。オープンデータを利用したビジョン・ランゲージモデル（VLM）の構築を目指す研究努力は、この最先端技術に大きく遅れをとっています。最近の強力なオープンウェイトモデルは、良好な性能を達成するためにプロプライエタリVLMからの合成データに大きく依存しており、実質的にこれらのクローズドモデルをオープンモデルに蒸留しています。その結果、コミュニティは、パフォーマンスの高いVLMをゼロから構築する方法に関する基礎的な知識をまだ欠いています。私たちは、最先端のVLMの新しいファミリーであるMolmoを紹介します。事前学習されたビジョンエンコーダー（CLIP）とテキスト専用のLLMから始まり、私たちのVLMパイプラインの残りの部分—ウェイト、コード、データ、評価—はすべてオープンで、VLMの蒸留から解放されています。私たちの重要な革新は、人間のアノテーターによる音声ベースの説明を使用して完全に収集された新しい、高度に詳細な画像キャプションデータセットです。幅広い機能を実現するために、ファインチューニング用の多様なデータセットの混合も導入します。これには、Molmoが自然言語だけでなく非言語的な手がかりを使用して質問に答えることを可能にする革新的な2Dポイントデータが含まれています。私たちは、これがエージェントが仮想および物理的な世界で相互作用するためのVLMの重要な将来の方向性を開くと信じています。私たちのアプローチの成功は、モデルアーキテクチャの詳細に対する慎重な選択、適切に調整されたトレーニングパイプライン、そして何よりも新たに収集したデータセットの質に依存しており、これらはすべて公開される予定です。Molmoファミリー内の最高のモデルは、オープンウェイトおよびデータモデルのクラス内で他のモデルを上回るだけでなく、GPT-4o、Claude 3.5、Gemini 1.5などのプロプライエタリシステムとも好対照を示します。私たちは、近い将来にすべてのモデルウェイト、キャプショニングおよびファインチューニングデータ、ソースコードを公開する予定です。選択されたモデルウェイト、推論コード、およびパブリックデモ（Molmo-7B-Dモデルを使用）は、本日から利用可能です。
Summary (by gpt-4o-mini)
Molmoは、オープンデータを活用した最先端のマルチモーダルAIモデルであり、特に小型モデルが大規模モデルを上回る性能を示す。Molmoは、物理的および仮想的な世界とのインタラクションを可能にし、音声ベースの説明を用いた新しい画像キャプションデータセットを導入。ファインチューニング用の多様なデータセットを使用し、非言語的手がかりを活用して質問に答える能力を持つ。Molmoファミリーのモデルは、オープンウェイトでプロプライエタリシステムに対抗する性能を発揮し、今後すべてのモデルウェイトやデータを公開予定。

AkihikoWatanabe / paper_notes

Molmo, AI2, 2024.09 #1426

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)