AkihikoWatanabe commented 1 year ago

URL

https://arxiv.org/abs/2310.03744
Affiliations
- Haotian Liu, N/A
- Chunyuan Li, N/A
- Yuheng Li, N/A
- Yong Jae Lee, N/A
  Abstract
- Large multimodal models (LMM) have recently shown encouraging progress withvisual instruction tuning. In this note, we show that the fully-connectedvision-language cross-modal connector in LLaVA is surprisingly powerful anddata-efficient. With simple modifications to LLaVA, namely, usingCLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQAdata with simple response formatting prompts, we establish stronger baselinesthat achieve state-of-the-art across 11 benchmarks. Our final 13B checkpointuses merely 1.2M publicly available data, and finishes full training in ~1 dayon a single 8-A100 node. We hope this can make state-of-the-art LMM researchmore accessible. Code and model will be publicly available.
  Translation (by gpt-3.5-turbo)
大規模なマルチモーダルモデル（LMM）は、最近、視覚的な指示の調整において励みを示しています。このノートでは、LLaVAの完全に接続されたビジョン-言語のクロスモーダルコネクタが驚くほど強力でデータ効率が良いことを示します。LLaVAに簡単な修正を加えることで、具体的にはMLPプロジェクションを使用したCLIP-ViT-L-336pxを使用し、シンプルな応答フォーマットのプロンプトを持つ学術タスク指向のVQAデータを追加することで、11のベンチマークで最先端のベースラインを確立します。最終的な13Bのチェックポイントは、わずかに120万の公開データを使用し、単一の8-A100ノードで約1日で完全なトレーニングを終えます。これにより、最先端のLMMの研究がよりアクセスしやすくなることを期待しています。コードとモデルは公開されます。
Summary (by gpt-3.5-turbo)
LLaVAは、ビジョンと言語のクロスモーダルコネクタであり、データ効率が高く強力な性能を持つことが示されています。CLIP-ViT-L-336pxを使用し、学術タスク指向のVQAデータを追加することで、11のベンチマークで最先端のベースラインを確立しました。13Bのチェックポイントはわずか120万の公開データを使用し、1日で完全なトレーニングを終えます。コードとモデルは公開されます。

AkihikoWatanabe commented 1 year ago

画像分析が可能なオープンソースLLMとのこと。

AkihikoWatanabe commented 1 year ago

Overview

画像生成をできるわけではなく、inputとして画像を扱えるのみ。

AkihikoWatanabe / paper_notes

Improved Baselines with Visual Instruction Tuning, Haotian Liu+, N/A, arXiv'23 #1068

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

Overview