BradyFU / Awesome-Multimodal-Large-Language-Models

:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
12.75k stars 812 forks source link
chain-of-thought in-context-learning instruction-following instruction-tuning large-language-models large-vision-language-model large-vision-language-models multi-modality multimodal-chain-of-thought multimodal-in-context-learning multimodal-instruction-tuning multimodal-large-language-models visual-instruction-tuning

Awesome-Multimodal-Large-Language-Models

Our MLLM works

๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ A Survey on Multimodal Large Language Models
Project Page [This Page] | Paper

The first comprehensive survey for Multimodal Large Language Models (MLLMs). :sparkles:

Welcome to add WeChat ID (wmd_ustc) to join our MLLM communication group! :star2:


๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

[๐ŸŽ Project Page] [๐Ÿ“– arXiv Paper] [๐ŸŒŸ GitHub]

The VITA team proposes Freeze-Omni, a speech-to-speech dialogue model with both low-latency and high intelligence while the training process is based on a frozen LLM. ๐ŸŒŸ

Freeze-Omni exhibits the characteristic of being smart as it is constructed upon a frozen text-modality LLM. This enables it to keep the original intelligence of the LLM backbone, without being affected by the forgetting problem induced by the fine-tuning process for integration of the speech modality. โœจ


๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ VITA: Towards Open-Source Interactive Omni Multimodal LLM

[๐ŸŽ Project Page] [๐Ÿ“– arXiv Paper] [๐ŸŒŸ GitHub] [๐Ÿค— Hugging Face] [๐Ÿ’ฌ WeChat (ๅพฎไฟก)]


๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Project Page | Paper | GitHub | Dataset | Leaderboard

We are very proud to launch Video-MME, the first-ever comprehensive evaluation benchmark of MLLMs in Video Analysis! ๐ŸŒŸ

It includes short- (< 2min), medium- (4min\~15min), and long-term (30min\~60min) videos, ranging from 11 seconds to 1 hour. All data are newly collected and annotated by humans, not from any existing video dataset. โœจ


๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Paper | Download | Eval Tool | :black_nib: Citation

A representative evaluation benchmark for MLLMs. :sparkles:


๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper | GitHub

This is the first work to correct hallucination in multimodal large language models. :sparkles:


Table of Contents

Awesome Papers

Multimodal Instruction Tuning

Title Venue Date Code Demo
Star
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
arXiv 2024-10-22 Github Demo
Star
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
arXiv 2024-10-09 Github -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models arXiv 2024-09-25 Huggingface Demo
Star
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
arXiv 2024-09-18 Github Demo
Star
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
arXiv 2024-09-04 Github -
Star
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
arXiv 2024-08-28 Github Demo
Star
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
arXiv 2024-08-09 Github -
Star
VITA: Towards Open-Source Interactive Omni Multimodal LLM
arXiv 2024-08-09 Github -
Star
LLaVA-OneVision: Easy Visual Task Transfer
arXiv 2024-08-06 Github Demo
Star
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
arXiv 2024-08-03 Github Demo
VILA^2: VILA Augmented VILA arXiv 2024-07-24 - -
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models arXiv 2024-07-22 - -
EVLM: An Efficient Vision-Language Model for Visual Understanding arXiv 2024-07-19 - -
Star
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
arXiv 2024-07-10 Github -
Star
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
arXiv 2024-07-03 Github Demo
Star
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
arXiv 2024-06-27 Github Local Demo
Star
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
arXiv 2024-06-24 Github Local Demo
Star
Long Context Transfer from Language to Vision
arXiv 2024-06-24 Github Local Demo
Star
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
ICML 2024-06-22 Github -
Star
Unveiling Encoder-Free Vision-Language Models
arXiv 2024-06-17 Github Local Demo
Star
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
CoRL 2024-06-15 Github Demo
Star
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
arXiv 2024-06-12 Github -
Star
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
arXiv 2024-06-11 Github Local Demo
Star
Parrot: Multilingual Visual Instruction Tuning
arXiv 2024-06-04 Github -
Star
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
arXiv 2024-05-31 Github -
Star
Matryoshka Query Transformer for Large Vision-Language Models
arXiv 2024-05-29 Github Demo
Star
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
arXiv 2024-05-24 Github -
Star
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
arXiv 2024-05-24 Github Demo
Star
Libra: Building Decoupled Vision System on Large Language Models
ICML 2024-05-16 Github Local Demo
Star
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
arXiv 2024-05-09 Github Local Demo
Star
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
arXiv 2024-04-25 Github Demo
Star
Graphic Design with Large Multimodal Model
arXiv 2024-04-22 Github -
Star
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
arXiv 2024-04-09 Github Demo
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs arXiv 2024-04-08 - -
Star
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
CVPR 2024-04-08 Github -
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model ACM TKDD 2024-03-28 - -
Star
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
arXiv 2024-03-27 Github Demo
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training arXiv 2024-03-14 - -
Star
MoAI: Mixture of All Intelligence for Large Language and Vision Models
arXiv 2024-03-12 Github Local Demo
Star
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
arXiv 2024-03-07 Github Demo
Star
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
arXiv 2024-02-29 Github -
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation CVPR 2024-02-26 Coming soon Coming soon
Star
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
arXiv 2024-02-19 Github -
Star
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
arXiv 2024-02-18 Github -
Star
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
arXiv 2024-02-18 Github Demo
Star
CoLLaVO: Crayon Large Language and Vision mOdel
arXiv 2024-02-17 Github -
Star
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
arXiv 2024-02-06 Github -
Star
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
arXiv 2024-02-06 Github -
Star
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
NeurIPS 2024-02-03 Github -
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study arXiv 2024-01-31 [Coming soon]() -
Star
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge
Blog 2024-01-30 Github Demo
Star
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
arXiv 2024-01-29 Github Demo
Star
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
arXiv 2024-01-29 Github Demo
Star
Yi-VL
- 2024-01-23 Github Local Demo
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities arXiv 2024-01-22 - -
Star
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
ACL 2024-01-04 Github Local Demo
Star
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices
arXiv 2023-12-28 Github -
Star
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
CVPR 2023-12-21 Github Demo
Star
Osprey: Pixel Understanding with Visual Instruction Tuning
CVPR 2023-12-15 Github Demo
Star
CogAgent: A Visual Language Model for GUI Agents
arXiv 2023-12-14 Github [Coming soon]()
Pixel Aligned Language Models arXiv 2023-12-14 [Coming soon]() -
See, Say, and Segment: Teaching LMMs to Overcome False Premises arXiv 2023-12-13 [Coming soon]() -
Star
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
ECCV 2023-12-11 Github Demo
Star
Honeybee: Locality-enhanced Projector for Multimodal LLM
CVPR 2023-12-11 Github -
Gemini: A Family of Highly Capable Multimodal Models Google 2023-12-06 - -
Star
OneLLM: One Framework to Align All Modalities with Language
arXiv 2023-12-06 Github Demo
Star
Lenna: Language Enhanced Reasoning Detection Assistant
arXiv 2023-12-05 Github -
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding arXiv 2023-12-04 - -
Star
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
arXiv 2023-12-04 Github Local Demo
Star
Making Large Multimodal Models Understand Arbitrary Visual Prompts
CVPR 2023-12-01 Github Demo
Star
Dolphins: Multimodal Language Model for Driving
arXiv 2023-12-01 Github -
Star
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
arXiv 2023-11-30 Github [Coming soon]()
Star
VTimeLLM: Empower LLM to Grasp Video Moments
arXiv 2023-11-30 Github Local Demo
Star
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model
arXiv 2023-11-30 Github -
Star
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
arXiv 2023-11-28 Github [Coming soon]()
Star
LLMGA: Multimodal Large Language Model based Generation Assistant
arXiv 2023-11-27 Github Demo
Star
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
arXiv 2023-11-27 Github -
Star
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
arXiv 2023-11-21 Github Demo
Star
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
arXiv 2023-11-20 Github -
Star
An Embodied Generalist Agent in 3D World
arXiv 2023-11-18 Github Demo
Star
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
arXiv 2023-11-16 Github Demo
Star
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
CVPR 2023-11-14 Github -
Star
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
arXiv 2023-11-13 Github -
Star
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
arXiv 2023-11-13 Github Demo
Star
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
CVPR 2023-11-11 Github Demo
Star
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
arXiv 2023-11-09 Github Demo
Star
NExT-Chat: An LMM for Chat, Detection and Segmentation
arXiv 2023-11-08 Github Local Demo
Star
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
arXiv 2023-11-07 Github Demo
Star
OtterHD: A High-Resolution Multi-modality Model
arXiv 2023-11-07 Github -
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding arXiv 2023-11-06 [Coming soon]() -
Star
GLaMM: Pixel Grounding Large Multimodal Model
CVPR 2023-11-06 Github Demo
Star
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
arXiv 2023-11-02 Github -
Star
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
arXiv 2023-10-14 Github Local Demo
Star
SALMONN: Towards Generic Hearing Abilities for Large Language Models
ICLR 2023-10-20 Github -
Star
Ferret: Refer and Ground Anything Anywhere at Any Granularity
arXiv 2023-10-11 Github -
Star
CogVLM: Visual Expert For Large Language Models
arXiv 2023-10-09 Github Demo
Star
Improved Baselines with Visual Instruction Tuning
arXiv 2023-10-05 Github Demo
Star
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
ICLR 2023-10-03 Github Demo
Star
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
arXiv 2023-10-01 Github -
Star
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
arXiv 2023-10-01 Github Local Demo
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model arXiv 2023-09-27 - -
Star
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
arXiv 2023-09-26 Github Local Demo
Star
DreamLLM: Synergistic Multimodal Comprehension and Creation
ICLR 2023-09-20 Github [Coming soon]()
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models arXiv 2023-09-18 [Coming soon]() -
Star
TextBind: Multi-turn Interleaved Multimodal Instruction-following
arXiv 2023-09-14 Github Demo
Star
NExT-GPT: Any-to-Any Multimodal LLM
arXiv 2023-09-11 Github Demo
Star
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
arXiv 2023-09-13 Github -
Star
ImageBind-LLM: Multi-modality Instruction Tuning
arXiv 2023-09-07 Github Demo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning arXiv 2023-09-05 - -
Star
PointLLM: Empowering Large Language Models to Understand Point Clouds
arXiv 2023-08-31 Github Demo
Star
โœจSparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
arXiv 2023-08-31 Github Local Demo
Star
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
arXiv 2023-08-25 Github -
Star
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
arXiv 2023-08-25 Github Demo
Star
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
arXiv 2023-08-24 Github Demo
Star
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
ICLR 2023-08-23 Github Demo
Star
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
arXiv 2023-08-20 Github -
Star
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
arXiv 2023-08-19 Github Demo
Star
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
arXiv 2023-08-08 Github -
Star
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
ICLR 2023-08-03 Github Demo
Star
LISA: Reasoning Segmentation via Large Language Model
arXiv 2023-08-01 Github Demo
Star
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
arXiv 2023-07-31 Github Local Demo
Star
3D-LLM: Injecting the 3D World into Large Language Models
arXiv 2023-07-24 Github -
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
arXiv 2023-07-18 - Demo
Star
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
arXiv 2023-07-17 Github Demo
Star
SVIT: Scaling up Visual Instruction Tuning
arXiv 2023-07-09 Github -
Star
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
arXiv 2023-07-07 Github Demo
Star
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
arXiv 2023-07-05 Github -
Star
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
arXiv 2023-07-04 Github Demo
Star
Visual Instruction Tuning with Polite Flamingo
arXiv 2023-07-03 Github Demo
Star
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
arXiv 2023-06-29 Github Demo
Star
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
arXiv 2023-06-27 Github Demo
Star
MotionGPT: Human Motion as a Foreign Language
arXiv 2023-06-26 Github -
Star
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
arXiv 2023-06-15 Github [Coming soon]()
Star
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
arXiv 2023-06-11 Github Demo
Star
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
arXiv 2023-06-08 Github Demo
Star
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
arXiv 2023-06-08 Github Demo
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning arXiv 2023-06-07 - -
Star
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
arXiv 2023-06-05 Github Demo
Star
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
arXiv 2023-06-01 Github -
Star
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
arXiv 2023-05-30 Github Demo
Star
PandaGPT: One Model To Instruction-Follow Them All
arXiv 2023-05-25 Github Demo
Star
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
arXiv 2023-05-25 Github -
Star
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
arXiv 2023-05-24 Github Local Demo
Star
DetGPT: Detect What You Need via Reasoning
arXiv 2023-05-23 Github Demo
Star
Pengi: An Audio Language Model for Audio Tasks
NeurIPS 2023-05-19 Github -
Star
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
arXiv 2023-05-18 Github -
Star
Listen, Think, and Understand
arXiv 2023-05-18 Github Demo
Star
VisualGLM-6B
- 2023-05-17 Github Local Demo
Star
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
arXiv 2023-05-17 Github -
Star
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
arXiv 2023-05-11 Github Local Demo
Star
VideoChat: Chat-Centric Video Understanding
arXiv 2023-05-10 Github Demo
Star
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
arXiv 2023-05-08 Github Demo
Star
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
arXiv 2023-05-07 Github -
Star
LMEye: An Interactive Perception Network for Large Language Models
arXiv 2023-05-05 Github Local Demo
Star
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
arXiv 2023-04-28 Github Demo
Star
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
arXiv 2023-04-27 Github Demo
Star
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
arXiv 2023-04-20 Github -
Star
Visual Instruction Tuning
NeurIPS 2023-04-17 GitHub Demo
Star
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
ICLR 2023-03-28 Github Demo
Star
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
ACL 2022-12-21 Github -

Multimodal Hallucination

Title Venue Date Code Demo
Star
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
arXiv 2024-10-04 Github -
Star
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
arXiv 2024-10-03 Github -
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs arXiv 2024-09-20 Link -
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation arXiv 2024-08-01 - -
Star
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
ECCV 2024-07-31 Github -
Star
Evaluating and Analyzing Relationship Hallucinations in LVLMs
ICML 2024-06-24 Github -
Star
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
arXiv 2024-06-18 Github -
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models arXiv 2024-06-04 [Coming soon]() -
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap arXiv 2024-05-24 [Coming soon]() -
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback arXiv 2024-04-22 - -
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding arXiv 2024-03-27 - -
Star
What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models
arXiv 2024-03-20 Github -
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization arXiv 2024-03-13 - -
Star
Debiasing Multimodal Large Language Models
arXiv 2024-03-08 Github -
Star
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
arXiv 2024-03-01 Github -
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding arXiv 2024-02-28 - -
Star
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective
arXiv 2024-02-22 Github -
Star
Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models
arXiv 2024-02-18 Github -
Star
The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs
arXiv 2024-02-06 Github -
Star
Unified Hallucination Detection for Multimodal Large Language Models
arXiv 2024-02-05 Github -
A Survey on Hallucination in Large Vision-Language Models arXiv 2024-02-01 - -
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models arXiv 2024-01-18 - -
Star
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
arXiv 2023-12-12 Github -
Star
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations
arXiv 2023-12-06 Github -
Star
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites
arXiv 2023-12-04 Github -
Star
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
arXiv 2023-12-01 Github Demo
Star
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
CVPR 2023-11-29 Github -
Star
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
CVPR 2023-11-28 Github -
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization arXiv 2023-11-28 Github [Comins Soon]()
Mitigating Hallucination in Visual Language Models with Visual Supervision arXiv 2023-11-27 - -
Star
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
arXiv 2023-11-22 Github -
Star
An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
arXiv 2023-11-13 Github -
Star
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models
arXiv 2023-11-02 Github -
Star
Woodpecker: Hallucination Correction for Multimodal Large Language Models
arXiv 2023-10-24 Github Demo
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models arXiv 2023-10-09 - -
Star
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption
arXiv 2023-10-03 Github -
Star
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
ICLR 2023-10-01 Github -
Star
Aligning Large Multimodal Models with Factually Augmented RLHF
arXiv 2023-09-25 Github Demo
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models arXiv 2023-09-07 - -
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning arXiv 2023-09-05 - -
Star
Evaluation and Analysis of Hallucination in Large Vision-Language Models
arXiv 2023-08-29 Github -
Star
VIGC: Visual Instruction Generation and Correction
arXiv 2023-08-24 Github Demo
Detecting and Preventing Hallucinations in Large Vision Language Models arXiv 2023-08-11 - -
Star
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
ICLR 2023-06-26 Github Demo
Star
Evaluating Object Hallucination in Large Vision-Language Models
EMNLP 2023-05-17 Github -

Multimodal In-Context Learning

Title Venue Date Code Demo
Visual In-Context Learning for Large Vision-Language Models arXiv 2024-02-18 - -
Star
Can MLLMs Perform Text-to-Image In-Context Learning?
arXiv 2024-02-02 Github -
Star
Generative Multimodal Models are In-Context Learners
CVPR 2023-12-20 Github Demo
Hijacking Context in Large Multi-modal Models arXiv 2023-12-07 - -
Towards More Unified In-context Visual Understanding arXiv 2023-12-05 - -
Star
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
arXiv 2023-09-14 Github Demo
Star
Link-Context Learning for Multimodal LLMs
arXiv 2023-08-15 Github Demo
Star
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
arXiv 2023-08-02 Github Demo
Star
Med-Flamingo: a Multimodal Medical Few-shot Learner
arXiv 2023-07-27 Github Local Demo
Star
Generative Pretraining in Multimodality
ICLR 2023-07-11 Github Demo
AVIS: Autonomous Visual Information Seeking with Large Language Models arXiv 2023-06-13 - -
Star
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
arXiv 2023-06-08 Github Demo
Star
Exploring Diverse In-Context Configurations for Image Captioning
NeurIPS 2023-05-24 Github -
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv 2023-04-19 Github Demo
Star
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
arXiv 2023-03-30 Github Demo
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023-03-20 Github Demo
Star
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction
ICCV 2023-03-09 Github -
Star
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering
CVPR 2023-03-03 Github -
Star
Visual Programming: Compositional visual reasoning without training
CVPR 2022-11-18 Github Local Demo
Star
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
AAAI 2022-06-28 Github -
Star
Flamingo: a Visual Language Model for Few-Shot Learning
NeurIPS 2022-04-29 Github Demo
Multimodal Few-Shot Learning with Frozen Language Models NeurIPS 2021-06-25 - -

Multimodal Chain-of-Thought

Title Venue Date Code Demo
Star
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM
arXiv 2024-04-24 Github Local Demo
Star
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models
arXiv 2024-03-25 Github Local Demo
Star
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
NeurIPS 2023-10-25 Github -
Star
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
arXiv 2023-06-27 Github Demo
Star
Explainable Multimodal Emotion Reasoning
arXiv 2023-06-27 Github -
Star
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
arXiv 2023-05-24 Github -
Letโ€™s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction arXiv 2023-05-23 - -
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering arXiv 2023-05-05 - -
Star
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
arXiv 2023-05-04 Github Demo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings arXiv 2023-05-03 Coming soon -
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv 2023-04-19 Github Demo
Chain of Thought Prompt Tuning in Vision Language Models arXiv 2023-04-16 [Coming soon]() -
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023-03-20 Github Demo
Star
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
arXiv 2023-03-08 Github Demo
Star
Multimodal Chain-of-Thought Reasoning in Language Models
arXiv 2023-02-02 Github -
Star
Visual Programming: Compositional visual reasoning without training
CVPR 2022-11-18 Github Local Demo
Star
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
NeurIPS 2022-09-20 Github -

LLM-Aided Visual Reasoning

Title Venue Date Code Demo
Star
Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models
arXiv 2024-03-27 Github -
Star
Vโˆ—: Guided Visual Search as a Core Mechanism in Multimodal LLMs
arXiv 2023-12-21 Github Local Demo
Star
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
arXiv 2023-11-01 Github Demo
MM-VID: Advancing Video Understanding with GPT-4V(vision) arXiv 2023-10-30 - -
Star
ControlLLM: Augment Language Models with Tools by Searching on Graphs
arXiv 2023-10-26 Github -
Star
Woodpecker: Hallucination Correction for Multimodal Large Language Models
arXiv 2023-10-24 Github Demo
Star
MindAgent: Emergent Gaming Interaction
arXiv 2023-09-18 Github -
Star
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
arXiv 2023-06-28 Github Demo
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models arXiv 2023-06-15 - -
Star
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
arXiv 2023-06-14 Github -
AVIS: Autonomous Visual Information Seeking with Large Language Models arXiv 2023-06-13 - -
Star
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
arXiv 2023-05-30 Github Demo
Mindstorms in Natural Language-Based Societies of Mind arXiv 2023-05-26 - -
Star
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
arXiv 2023-05-24 Github -
Star
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
arXiv 2023-05-24 Github Local Demo
Star
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
arXiv 2023-05-10 Github -
Star
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
arXiv 2023-05-04 Github Demo
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv 2023-04-19 Github Demo
Star
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
arXiv 2023-03-30 Github Demo
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023-03-20 Github Demo
Star
ViperGPT: Visual Inference via Python Execution for Reasoning
arXiv 2023-03-14 Github Local Demo
Star
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
arXiv 2023-03-12 Github Local Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction ICCV 2023-03-09 - -
Star
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
arXiv 2023-03-08 Github Demo
Star
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
CVPR 2023-03-03 Github -
Star
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models
CVPR 2022-12-21 Github Demo
Star
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
arXiv 2022-11-28 Github -
Star
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning
CVPR 2022-11-21 Github -
Star
Visual Programming: Compositional visual reasoning without training
CVPR 2022-11-18 Github Local Demo
Star
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
arXiv 2022-04-01 Github -

Foundation Models

Title Venue Date Code Demo
Star
Emu3: Next-Token Prediction is All You Need
arXiv 2024-09-27 Github Local Demo
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models Meta 2024-09-25 - Demo
Pixtral-12B Mistral 2024-09-17 - -
Star
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
arXiv 2024-08-16 Github -
The Llama 3 Herd of Models arXiv 2024-07-31 - -
Chameleon: Mixed-Modal Early-Fusion Foundation Models arXiv 2024-05-16 - -
Hello GPT-4o OpenAI 2024-05-13 - -
The Claude 3 Model Family: Opus, Sonnet, Haiku Anthropic 2024-03-04 - -
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Google 2024-02-15 - -
Gemini: A Family of Highly Capable Multimodal Models Google 2023-12-06 - -
Fuyu-8B: A Multimodal Architecture for AI Agents blog 2023-10-17 Huggingface Demo
Star
Unified Model for Image, Video, Audio and Language Tasks
arXiv 2023-07-30 Github Demo
PaLI-3 Vision Language Models: Smaller, Faster, Stronger arXiv 2023-10-13 - -
GPT-4V(ision) System Card OpenAI 2023-09-25 - -
Star
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
arXiv 2023-09-09 Github -
Multimodal Foundation Models: From Specialists to General-Purpose Assistants arXiv 2023-09-18 - -
Star
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
NeurIPS 2023-07-13 Github -
Star
Generative Pretraining in Multimodality
arXiv 2023-07-11 Github Demo
Star
Kosmos-2: Grounding Multimodal Large Language Models to the World
arXiv 2023-06-26 Github Demo
Star
Transfer Visual Prompt Generator across LLMs
arXiv 2023-05-02 Github Demo
GPT-4 Technical Report arXiv 2023-03-15 - -
PaLM-E: An Embodied Multimodal Language Model arXiv 2023-03-06 - Demo
Star
Prismer: A Vision-Language Model with An Ensemble of Experts
arXiv 2023-03-04 Github Demo
Star
Language Is Not All You Need: Aligning Perception with Language Models
arXiv 2023-02-27 Github -
Star
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
arXiv 2023-01-30 Github Demo
Star
VIMA: General Robot Manipulation with Multimodal Prompts
ICML 2022-10-06 Github Local Demo
Star
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
NeurIPS 2022-06-17 Github -
Star
Write and Paint: Generative Vision-Language Models are Unified Modal Learners
ICLR 2022-06-15 Github -
Star
Language Models are General-Purpose Interfaces
arXiv 2022-06-13 Github -

Evaluation

Title Venue Date Page
Stars
OmniBench: Towards The Future of Universal Omni-Language Models
arXiv 2024-09-23 Github
Stars
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
arXiv 2024-08-23 Github
Stars
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
TPAMI 2023-10-17 Github
Stars
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
arXiv 2024-06-29 Github
Stars
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
arXiv 2024-06-28 Github
Stars
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
arXiv 2024-06-26 Github
Stars
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
arXiv 2024-04-15 Github
Stars
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
arXiv 2024-05-31 Github
Stars
Benchmarking Large Multimodal Models against Common Corruptions
NAACL 2024-01-22 Github
Stars
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
arXiv 2024-01-11 Github
Stars
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
arXiv 2023-12-19 Github
Stars
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
arXiv 2023-12-05 Github
Star
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
arXiv 2023-11-27 Github
Star
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
arXiv 2023-11-24 Github
Star
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V
arXiv 2023-11-23 Github
VLM-Eval: A General Evaluation on Video Large Language Models arXiv 2023-11-20 [Coming soon]()
Star
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
arXiv 2023-11-06 Github
Star
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving
arXiv 2023-11-09 Github
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead arXiv 2023-11-05 -
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging arXiv 2023-10-31 -
Star
An Early Evaluation of GPT-4V(ision)
arXiv 2023-10-25 Github
Star
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation
arXiv 2023-10-25 Github
Star
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
CVPR 2023-10-23 Github
Star
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models
ICLR 2023-10-03 Github
Star
Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations
arXiv 2023-10-02 Github
Star
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning
arXiv 2023-10-01 Github
Star
Can We Edit Multimodal Large Language Models?
arXiv 2023-10-12 Github
Star
REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets
arXiv 2023-10-10 Github
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision) arXiv 2023-09-29 -
Star
TouchStone: Evaluating Vision-Language Models by Language Models
arXiv 2023-08-31 Github
Star
โœจSparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
arXiv 2023-08-31 Github
Star
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs
arXiv 2023-08-07 Github
Star
Tiny LVLM-eHub: Early Multimodal Experiments with Bard
arXiv 2023-08-07 Github
Star
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
arXiv 2023-08-04 Github
Star
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
CVPR 2023-07-30 Github
Star
MMBench: Is Your Multi-modal Model an All-around Player?
arXiv 2023-07-12 Github
Star
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
arXiv 2023-06-23 Github
Star
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
arXiv 2023-06-15 Github
Star
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
arXiv 2023-06-11 Github
Star
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
arXiv 2023-06-08 Github
Star
On The Hidden Mystery of OCR in Large Multimodal Models
arXiv 2023-05-13 Github

Multimodal RLHF

Title Venue Date Code Demo
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization arXiv 2024-10-09 - -
Star
Silkie: Preference Distillation for Large Visual Language Models
arXiv 2023-12-17 Github -
Star
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
arXiv 2023-12-01 Github Demo
Star
Aligning Large Multimodal Models with Factually Augmented RLHF
arXiv 2023-09-25 Github Demo

Others

Title Venue Date Code Demo
Star
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
arXiv 2024-02-03 Github -
Star
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
arXiv 2023-12-21 Github Local Demo
Star
Prompt Highlighter: Interactive Control for Multi-Modal LLMs
arXiv 2023-12-07 Github -
Star
Planting a SEED of Vision in Large Language Model
arXiv 2023-07-16 Github
Star
Can Large Pre-trained Models Help Vision Models on Perception Tasks?
arXiv 2023-06-01 Github -
Star
Contextual Object Detection with Multimodal Large Language Models
arXiv 2023-05-29 Github Demo
Star
Generating Images with Multimodal Language Models
arXiv 2023-05-26 Github -
Star
On Evaluating Adversarial Robustness of Large Vision-Language Models
arXiv 2023-05-26 Github -
Star
Grounding Language Models to Images for Multimodal Inputs and Outputs
ICML 2023-01-31 Github Demo

Awesome Datasets

Datasets of Pre-Training for Alignment

Name Paper Type Modalities
ShareGPT4Video ShareGPT4Video: Improving Video Understanding and Generation with Better Captions Caption Video-Text
COYO-700M COYO-700M: Image-Text Pair Dataset Caption Image-Text
ShareGPT4V ShareGPT4V: Improving Large Multi-Modal Models with Better Captions Caption Image-Text
AS-1B The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World Hybrid Image-Text
InternVid InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation Caption Video-Text
MS-COCO Microsoft COCO: Common Objects in Context Caption Image-Text
SBU Captions Im2Text: Describing Images Using 1 Million Captioned Photographs Caption Image-Text
Conceptual Captions Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning Caption Image-Text
LAION-400M LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs Caption Image-Text
VG Captions Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations Caption Image-Text
Flickr30k Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models Caption Image-Text
AI-Caps AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding Caption Image-Text
Wukong Captions Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark Caption Image-Text
GRIT Kosmos-2: Grounding Multimodal Large Language Models to the World Caption Image-Text-Bounding-Box
Youku-mPLUG Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Caption Video-Text
MSR-VTT MSR-VTT: A Large Video Description Dataset for Bridging Video and Language Caption Video-Text
Webvid10M Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Caption Video-Text
WavCaps WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research Caption Audio-Text
AISHELL-1 AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline ASR Audio-Text
AISHELL-2 AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale ASR Audio-Text
VSDial-CN X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages ASR Image-Audio-Text

Datasets of Multimodal Instruction Tuning

Name Paper Link Notes
UNK-VQA UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models Link A dataset designed to teach models to refrain from answering unanswerable questions
VEGA VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models Link A dataset for enhancing model capabilities in comprehension of interleaved information
ALLaVA-4V ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model Link Vision and language caption and instruction dataset generated by GPT4V
IDK Visually Dehallucinative Instruction Generation: Know What You Don't Know Link Dehallucinative visual instruction for "I Know" hallucination
CAP2QA Visually Dehallucinative Instruction Generation Link Image-aligned visual instruction dataset
M3DBench M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts Link A large-scale 3D instruction tuning dataset
ViP-LLaVA-Instruct Making Large Multimodal Models Understand Arbitrary Visual Prompts Link A mixture of LLaVA-1.5 instruction data and the region-level visual prompting data
LVIS-Instruct4V To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning Link A visual instruction dataset via self-instruction from GPT-4V
ComVint What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning Link A synthetic instruction dataset for complex visual reasoning
SparklesDialogue โœจSparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models Link A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns.
StableLLaVA StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data Link A cheap and effective approach to collect visual instruction tuning data
M-HalDetect Detecting and Preventing Hallucinations in Large Vision Language Models [Coming soon]() A dataset used to train and benchmark models for hallucination detection and prevention
MGVLID ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning - A high-quality instruction-tuning dataset including image-text and region-text pairs
BuboGPT BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs Link A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data
SVIT SVIT: Scaling up Visual Instruction Tuning Link A large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs
mPLUG-DocOwl mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding Link An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding
PF-1M Visual Instruction Tuning with Polite Flamingo Link A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo.
ChartLlama ChartLlama: A Multimodal LLM for Chart Understanding and Generation Link A multi-modal instruction-tuning dataset for chart understanding and generation
LLaVAR LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Link A visual instruction-tuning dataset for Text-rich Image Understanding
MotionGPT MotionGPT: Human Motion as a Foreign Language Link A instruction-tuning dataset including multiple human motion-related tasks
LRV-Instruction Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning Link Visual instruction tuning dataset for addressing hallucination issue
Macaw-LLM Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration Link A large-scale multi-modal instruction dataset in terms of multi-turn dialogue
LAMM-Dataset LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark Link A comprehensive multi-modal instruction tuning dataset
Video-ChatGPT Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Link 100K high-quality video instruction dataset
MIMIC-IT MIMIC-IT: Multi-Modal In-Context Instruction Tuning Link Multimodal in-context instruction tuning
M3IT M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning Link Large-scale, broad-coverage multimodal instruction tuning dataset
LLaVA-Med LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day Coming soon A large-scale, broad-coverage biomedical instruction-following dataset
GPT4Tools GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Link Tool-related instruction datasets
MULTIS ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst Coming soon Multimodal instruction tuning dataset covering 16 multimodal tasks
DetGPT DetGPT: Detect What You Need via Reasoning Link Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs
PMC-VQA PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering Coming soon Large-scale medical visual question-answering dataset
VideoChat VideoChat: Chat-Centric Video Understanding Link Video-centric multimodal instruction dataset
X-LLM X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages Link Chinese multimodal instruction dataset
LMEye LMEye: An Interactive Perception Network for Large Language Models Link A multi-modal instruction-tuning dataset
cc-sbu-align MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Link Multimodal aligned dataset for improving model's usability and generation's fluency
LLaVA-Instruct-150K Visual Instruction Tuning Link Multimodal instruction-following data generated by GPT
MultiInstruct MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning Link The first multimodal instruction tuning benchmark dataset

Datasets of In-Context Learning

Name Paper Link Notes
MIC MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning Link A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs.
MIMIC-IT MIMIC-IT: Multi-Modal In-Context Instruction Tuning Link Multimodal in-context instruction dataset

Datasets of Multimodal Chain-of-Thought

Name Paper Link Notes
EMER Explainable Multimodal Emotion Reasoning Coming soon A benchmark dataset for explainable emotion reasoning task
EgoCOT EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought Coming soon Large-scale embodied planning dataset
VIP Letโ€™s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction [Coming soon]() An inference-time dataset that can be used to evaluate VideoCOT
ScienceQA Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering Link Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains

Datasets of Multimodal RLHF

Name Paper Link Notes
VLFeedback Silkie: Preference Distillation for Large Visual Language Models Link A vision-language feedback dataset annotated by AI

Benchmarks for Evaluation

Name Paper Link Notes
LiveXiv LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content Link A live benchmark based on arXiv papers
TemporalBench TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models Link A benchmark for evaluation of fine-grained temporal understanding
OmniBench OmniBench: Towards The Future of Universal Omni-Language Models Link A benchmark that evaluates models' capabilities of processing visual, acoustic, and textual inputs simultaneously
MME-RealWorld MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? Link A challenging benchmark that involves real-life scenarios
CharXiv CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs Link Chart understanding benchmark curated by human experts
Video-MME Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Link A comprehensive evaluation benchmark of Multi-modal LLMs in video analysis
VL-ICL Bench VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning Link A benchmark for M-ICL evaluation, covering a wide spectrum of tasks
TempCompass TempCompass: Do Video LLMs Really Understand Videos? Link A benchmark to evaluate the temporal perception ability of Video LLMs
CoBSAT Can MLLMs Perform Text-to-Image In-Context Learning? Link A benchmark for text-to-image ICL
VQAv2-IDK Visually Dehallucinative Instruction Generation: Know What You Don't Know Link A benchmark for assessing "I Know" visual hallucination
Math-Vision Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset Link A diverse mathematical reasoning benchmark
CMMMU CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark Link A Chinese benchmark involving reasoning and knowledge across multiple disciplines
MMCBench Benchmarking Large Multimodal Models against Common Corruptions Link A benchmark for examining self-consistency under common corruptions
MMVP Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs Link A benchmark for assessing visual capabilities
TimeIT TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding Link A video instruction-tuning dataset with timestamp annotations, covering diverse time-sensitive video-understanding tasks.
ViP-Bench Making Large Multimodal Models Understand Arbitrary Visual Prompts Link A benchmark for visual prompts
M3DBench M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts Link A 3D-centric benchmark
Video-Bench Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models Link A benchmark for video-MLLM evaluation
Charting-New-Territories Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs Link A benchmark for evaluating geographic and geospatial capabilities
MLLM-Bench MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V Link GPT-4V evaluation with per-sample criteria
BenchLMM BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models Link A benchmark for assessment of the robustness against different image styles
MMC-Benchmark MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning Link A comprehensive human-annotated benchmark with distinct tasks evaluating reasoning capabilities over charts
MVBench MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Link A comprehensive multimodal benchmark for video understanding
Bingo Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges Link A benchmark for hallucination evaluation that focuses on two common types
MagnifierBench OtterHD: A High-Resolution Multi-modality Model Link A benchmark designed to probe models' ability of fine-grained perception
HallusionBench HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models Link An image-context reasoning benchmark for evaluation of hallucination
PCA-EVAL Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond Link A benchmark for evaluating multi-domain embodied decision-making.
MMHal-Bench Aligning Large Multimodal Models with Factually Augmented RLHF Link A benchmark for hallucination evaluation
MathVista MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models Link A benchmark that challenges both visual and math reasoning capabilities
SparklesEval โœจSparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models Link A GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns based on three distinct criteria.
ISEKAI Link-Context Learning for Multimodal LLMs Link A benchmark comprising exclusively of unseen generated image-label pairs designed for link-context learning
M-HalDetect Detecting and Preventing Hallucinations in Large Vision Language Models [Coming soon]() A dataset used to train and benchmark models for hallucination detection and prevention
I4 Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions Link A benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions
SciGraphQA SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs Link A large-scale chart-visual question-answering dataset
MM-Vet MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities Link An evaluation benchmark that examines large multimodal models on complicated multimodal tasks
SEED-Bench SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension Link A benchmark for evaluation of generative comprehension in MLLMs
MMBench MMBench: Is Your Multi-modal Model an All-around Player? Link A systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models
Lynx What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? Link A comprehensive evaluation benchmark including both image and video tasks
GAVIE Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning Link A benchmark to evaluate the hallucination and instruction following ability
MME MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models Link A comprehensive MLLM Evaluation benchmark
LVLM-eHub LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models Link An evaluation platform for MLLMs
LAMM-Benchmark LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark Link A benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks
M3Exam M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models Link A multilingual, multimodal, multilevel benchmark for evaluating MLLM
OwlEval mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality Link Dataset for evaluation on multiple capabilities

Others

Name Paper Link Notes
IMAD IMAD: IMage-Augmented multi-modal Dialogue Link Multimodal dialogue dataset
Video-ChatGPT Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Link A quantitative evaluation framework for video-based dialogue models
CLEVR-ATVC Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation Link A synthetic multimodal fine-tuning dataset for learning to reject instructions
Fruit-ATVC Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation Link A manually pictured multimodal fine-tuning dataset for learning to reject instructions
InfoSeek Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? Link A VQA dataset that focuses on asking information-seeking questions
OVEN Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities Link A dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild