issues
search
DefTruth
/
Awesome-LLM-Inference
πA curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
https://github.com/DefTruth/Awesome-LLM-Inference
GNU General Public License v3.0
2.59k
stars
174
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Bump up to v2.6
#79
DefTruth
closed
1 day ago
0
π₯[LayerKV] Optimizing Large Language Model Serving with Layer-wise KV Cache Management
#78
DefTruth
closed
1 day ago
0
π₯[KV-COMPRESS] PAGED KV-CACHE COMPRESSION WITH VARIABLE COMPRESSION RATES PER ATTENTION HEAD
#77
DefTruth
closed
1 day ago
0
π₯π₯[Tensor Cores] Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores
#76
DefTruth
closed
1 week ago
0
π₯[AlignedKV] AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization
#75
DefTruth
closed
1 week ago
0
π₯π₯[HiFloat8] Ascend HiFloat8 Format for Deep Learning
#74
DefTruth
closed
1 week ago
0
[Low-bit] A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms
#73
DefTruth
closed
1 week ago
0
π₯π₯[INT-FLASHATTENTION] INT-FLASHATTENTION: ENABLING FLASH ATTENTION FOR INT8 QUANTIZATION
#72
DefTruth
closed
1 week ago
0
fix typo
#71
DefTruth
closed
1 week ago
0
π₯[VPTQ] VPTQ: EXTREME LOW-BIT VECTOR POST-TRAINING QUANTIZATION FOR LARGE LANGUAGE MODELS
#70
DefTruth
closed
1 week ago
0
Bump up to v2.5
#69
DefTruth
closed
1 week ago
0
π₯π₯[CRITIPREFILL] CRITIPREFILL: A SEGMENT-WISE CRITICALITYBASED APPROACH FOR PREFILLING ACCELERATION IN LLMS
#68
DefTruth
closed
1 week ago
0
move RetrievalAttention -> long context
#67
DefTruth
closed
1 week ago
0
Update codebase of paper "parallel speculative decoding with adaptive draft length"
#66
smart-lty
closed
2 weeks ago
1
π₯[InstInfer] InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
#65
DefTruth
closed
2 weeks ago
0
Bump up to v2.4
#64
DefTruth
closed
2 weeks ago
0
π₯[Inf-MLLM] Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU
#63
DefTruth
closed
2 weeks ago
0
π₯[RetrievalAttention] Accelerating Long-Context LLM Inference via Vector Retrieval
#62
DefTruth
closed
2 weeks ago
0
Bump up to v2.3
#61
DefTruth
closed
3 weeks ago
0
π₯[SpMM] High Performance Unstructured SpMM Computation Using Tensor Cores
#60
DefTruth
closed
4 weeks ago
0
π₯[CHESS] CHESS : Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification
#59
DefTruth
closed
4 weeks ago
0
Bump up to v2.2
#58
DefTruth
closed
1 month ago
0
π₯π₯[Context Distillation] Efficient LLM Context Distillation
#57
DefTruth
closed
1 month ago
0
π₯π₯[Prompt Compression] Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference
#56
DefTruth
closed
1 month ago
0
π₯[Speculative Decoding] Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation
#55
DefTruth
closed
1 month ago
0
π₯[SJF Scheduling] Efficient LLM Scheduling by Learning to Rank
#54
DefTruth
closed
1 month ago
0
π₯[Decentralized LLM] Decentralized LLM Inference over Edge Networks with Energy Harvesting
#53
DefTruth
closed
1 month ago
0
π₯[ACTIVATION SPARSITY] TRAINING-FREE ACTIVATION SPARSITY IN LARGE LANGUAGE MODELS
#52
DefTruth
closed
1 month ago
0
Add NanoFlow code link
#51
DefTruth
closed
1 month ago
0
Bump up to v2.1
#50
DefTruth
closed
1 month ago
0
π₯π₯[FLA] FLA: A Triton-Based Library for Hardware-Efficient Implementaβ¦
#49
DefTruth
closed
1 month ago
0
π₯[1-bit LLMs] Matmul or No Matmal in the Era of 1-bit LLMs
#48
DefTruth
closed
1 month ago
0
π₯π₯[MARLIN] MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
#47
DefTruth
closed
1 month ago
0
Add ABQ-LLM code link
#46
DefTruth
closed
1 month ago
0
add code linkγABQ-LLM γ
#45
lswzjuer
closed
1 month ago
2
π₯[MagicDec] MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
#44
DefTruth
closed
1 month ago
0
π₯[NanoFlow] NanoFlow: Towards Optimal Large Language Model Serving Throughput
#43
DefTruth
closed
1 month ago
0
π₯[FocusLLM] FocusLLM: Scaling LLMβs Context by Parallel Decoding
#42
DefTruth
closed
1 month ago
0
π₯[Speculative Decoding] Parallel Speculative Decoding with Adaptive Draft Length
#41
DefTruth
closed
1 month ago
0
Update README.md
#40
DefTruth
closed
1 month ago
0
Bump up to v2.0
#39
DefTruth
closed
1 month ago
0
[Token Recycling] Turning Trash into Treasure: Accelerating Inferenceβ¦
#38
DefTruth
closed
1 month ago
0
π₯[ABQ-LLM] Arbitrary-Bit Quantized Inference Acceleration for Large Language Models
#37
DefTruth
closed
1 month ago
0
Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference
#36
DefTruth
closed
1 month ago
0
KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning
#35
DefTruth
closed
1 month ago
0
π₯π₯[Eigen Attention] Attention in Low-Rank Space for KV Cache Compression
#34
DefTruth
closed
1 month ago
0
π₯π₯[LUT TENSOR CORE] Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration
#33
DefTruth
closed
1 month ago
0
Bump up to v1.9
#32
DefTruth
closed
1 month ago
0
π₯π₯[500xCompressor] 500xCompressor: Generalized Prompt Compression forβ¦
#31
DefTruth
closed
1 month ago
0
π₯[Automatic Inference Engine Tuning] Towards SLO-Optimized LLM Servinβ¦
#30
DefTruth
closed
1 month ago
0
Next