Large multimodal models (LMM) have recently shown encouraging progress withvisual instruction tuning. In this note, we show that the fully-connectedvision-language cross-modal connector in LLaVA is surprisingly powerful anddata-efficient. With simple modifications to LLaVA, namely, usingCLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQAdata with simple response formatting prompts, we establish stronger baselinesthat achieve state-of-the-art across 11 benchmarks. Our final 13B checkpointuses merely 1.2M publicly available data, and finishes full training in ~1 dayon a single 8-A100 node. We hope this can make state-of-the-art LMM researchmore accessible. Code and model will be publicly available.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)