🔥🔥🔥 A paper list of some recent works about Token Compress for Vit and VLM.
Don’t Look Twice: Faster Video Transformers with Run-Length Tokenization .[RLT;Video;NeurIPS 2024;Github]
Inference Optimal VLMs Need Only One Visual Token but Larger Models .[QueCC;Github]
Video Token Merging for Long-form Video Understandin .[Learnable VTM;Video]
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding .[LongVU;Video;Github]
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction .[PyramidDrop;Github]
Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers .[Victor;]
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models .[VidCompress;]
Retrieval Replace Reduction:An effective visual token reduction method via semantic match .[TRSM;]
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity .[AVG-LLaVA;Github]
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs .[TRIM]
TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Consideration .[TC-LLaVA;Video;]
TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings .[TG-LLaVA]
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding . [mPLUG-DocOwl2;Github]
TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval . [TempMe;Video;Github]
Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information . [Recoverable Compression]
HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments . [HiRED;Github]
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding . [Token-level;Github]
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models . [HiRes-LLaVA;]
TokenPacker: Efficient Visual Projector for Multimodal LLM . [TokenPacker;Github]
VoCo-LLaMA: Towards Vision Compression with Large Language Models . [VoCo-LLaMA;Github]
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models . [DeCo;Github]
Matryoshka Multimodal Models . [Matryoshka;M3]Github]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites . [InternVL;Pixel-Shuffle;Github]
CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference . [CATP;]
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models . [LLaVA-PruMerge;Github]
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-PLay Acceleration for VLLM Inference . [FastV;ECCV 2024;Github]
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model . [LDP-v2;Github]