AkihikoWatanabe / paper_notes

たまに追加される論文メモ
https://AkihikoWatanabe.github.io/paper_notes
17 stars 0 forks source link

Molmo, AI2, 2024.09 #1426

Open AkihikoWatanabe opened 1 day ago

AkihikoWatanabe commented 1 day ago

https://molmo.allenai.org/blog

AkihikoWatanabe commented 1 day ago

Molmo is a family of open state-of-the-art multimodal AI models. Our most powerful model closes the gap between open and proprietary systems across a wide range of academic benchmarks as well as human evaluations. Our smaller models outperform models 10x their size. While current multimodal models interpret multimodal data and express it in natural language, their full potential remains untapped. Molmo goes beyond. By learning to point at what it perceives, Molmo enables rich interactions with physical and virtual worlds, empowering the next generation of applications capable of acting and interacting with their environments.Today's most advanced multimodal models remain proprietary. Research efforts aimed at building vision-language models (VLMs) utilizing open data lag significantly behind this state-of-the-art. Recent stronger open-weights models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of state-of-the-art VLMs. Starting from a pre-trained vision encoder (CLIP) and language-only LLMs, the entire remainder of our VLM pipeline – weights, code, data, and evaluations – is open and free from VLM distillation. Our key innovation is a novel, highly-detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of capabilities, we also introduce a diverse dataset mixture for fine-tuning. This includes innovative 2D pointing data that enables Molmo to answer questions not just using natural language but also using non verbal cues. We believe this opens up important future directions for VLMs enabling agents to interact in virtual and physical worlds. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and most critically the quality of our newly collected datasets, all of which will be released. The best in class model within the Molmo family not only outperforms others in the class of open weight and data models, but also compares favorably against proprietary systems like GPT-4o, Claude 3.5 and Gemini 1.5. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and a public demo (using Molmo-7B-D model) are available starting today.

Translation (by gpt-4o-mini)

AkihikoWatanabe commented 1 day ago

以下がベンチマーク結果(VLMのベンチマーク)。11 benchmarksと書かれているのは、VLMのベンチマークである点に注意。

image image