Yangyi-Chen / Multimodal-AND-Large-Language-Models

Paper list about multimodal and large language models, only used to record papers I read in the daily arxiv for personal needs.
481 stars 28 forks source link
general-purpose-model large-language-models machine-learning multimodal

Multimodal & Large Language Models

Note: This paper list is only used to record papers I read in the daily arxiv for personal needs. I only subscribe and cover the following subjects: Artificial Intelligence (cs.AI), Computation and Language (cs.CL), Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG). If you find I missed some important and exciting work, it would be super helpful to let me know. Thanks!

Table of Contents

Survey

Position Paper

Structure

Event Extraction

Situation Recognition

Scene Graph

Attribute

Compositionality

Concept

Planning

Reasoning

Common sense.

Generation

Representation Learning

LLM Analysis

Calibration & Uncertainty

LLM Safety

LLM Evaluation

LLM Reasoning

Self-consistency

(with images)

LLM Application

LLM with Memory

Retrieval-augmented LLM

LLM with Human

LLM Foundation

Scaling Law

MoE

LLM Data Engineering

VLM Data Engineering

Alignment

Scalable Oversight & SuperAlignment

RL Foundation

Beyond Bandit

Agent

AutoTelic Agent

Evaluation

VL Related Task

Interaction

Critique Modeling

MoE/Specialized

Vision-Language Foundation Model

First Generation: Using region-based features; can be classified as one- and two- streams model architectures; Before 2020.6;

Introduce image tags to learn image-text alignments.

Second Generation: Get rid of ROI and object detectors for acceleration; Moving to large pretraining datasets; Moving to unified architectures for understanding and generation tasks; Mostly before 2022.6.

Special designs tailored to enhance the position encoding & grounding.

Motivate to use unparalleled image & text data to build a unified model for VL, vision, and language tasks and potentially bring better performance.

Third Generation: Chasing for one unified/general/generalist model to include more VL/NLP/CV tasks; Becoming larger & Stronger; 2022->Now.

Generalist models

Fourth Generation: Relying on LLMs and instruction tuning

Others

Vision-Language Model Application

Vision-Language Model Analysis & Evaluation

Multimodal Foundation Model

Image Generation

Diffusion

Document Understanding

Dataset

Table

Tool Learning

NLP

With Visual Tools

Instruction Tuning

Incontext Learning

Learning from Feedback

Video Foundation Model

Key Frame Detection

Vision Model

Pretraining

Visual-augmented LM

Novel techniques.

Adaptation of Foundation Model

Prompting

Efficiency

Analysis

Grounding

VQA Task

VQA Dataset

Cognition

Knowledge

Social Good

Application

Benchmark & Evaluation

Dataset

Robustness

Hallucination&Factuality

Cognitive NeuronScience & Machine Learning

Theory of Mind

Cognitive NeuronScience

World Model

Resource