DefTruth Awesome-LLM-Inference issues

DefTruth / Awesome-LLM-Inference

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

https://github.com/DefTruth/Awesome-LLM-Inference

GNU General Public License v3.0

2.59k stars 174 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Bump up to v2.6

#79 DefTruth closed 1 day ago
0
🔥[LayerKV] Optimizing Large Language Model Serving with Layer-wise KV Cache Management

#78 DefTruth closed 1 day ago
0
🔥[KV-COMPRESS] PAGED KV-CACHE COMPRESSION WITH VARIABLE COMPRESSION RATES PER ATTENTION HEAD

#77 DefTruth closed 1 day ago
0
🔥🔥[Tensor Cores] Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

#76 DefTruth closed 1 week ago
0
🔥[AlignedKV] AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization

#75 DefTruth closed 1 week ago
0
🔥🔥[HiFloat8] Ascend HiFloat8 Format for Deep Learning

#74 DefTruth closed 1 week ago
0
[Low-bit] A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

#73 DefTruth closed 1 week ago
0
🔥🔥[INT-FLASHATTENTION] INT-FLASHATTENTION: ENABLING FLASH ATTENTION FOR INT8 QUANTIZATION

#72 DefTruth closed 1 week ago
0
fix typo

#71 DefTruth closed 1 week ago
0
🔥[VPTQ] VPTQ: EXTREME LOW-BIT VECTOR POST-TRAINING QUANTIZATION FOR LARGE LANGUAGE MODELS

#70 DefTruth closed 1 week ago
0
Bump up to v2.5

#69 DefTruth closed 1 week ago
0
🔥🔥[CRITIPREFILL] CRITIPREFILL: A SEGMENT-WISE CRITICALITYBASED APPROACH FOR PREFILLING ACCELERATION IN LLMS

#68 DefTruth closed 1 week ago
0
move RetrievalAttention -> long context

#67 DefTruth closed 1 week ago
0
Update codebase of paper "parallel speculative decoding with adaptive draft length"

#66 smart-lty closed 2 weeks ago
1
🔥[InstInfer] InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

#65 DefTruth closed 2 weeks ago
0
Bump up to v2.4

#64 DefTruth closed 2 weeks ago
0
🔥[Inf-MLLM] Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

#63 DefTruth closed 2 weeks ago
0
🔥[RetrievalAttention] Accelerating Long-Context LLM Inference via Vector Retrieval

#62 DefTruth closed 2 weeks ago
0
Bump up to v2.3

#61 DefTruth closed 3 weeks ago
0
🔥[SpMM] High Performance Unstructured SpMM Computation Using Tensor Cores

#60 DefTruth closed 4 weeks ago
0
🔥[CHESS] CHESS : Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification

#59 DefTruth closed 4 weeks ago
0
Bump up to v2.2

#58 DefTruth closed 1 month ago
0
🔥🔥[Context Distillation] Efficient LLM Context Distillation

#57 DefTruth closed 1 month ago
0
🔥🔥[Prompt Compression] Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference

#56 DefTruth closed 1 month ago
0
🔥[Speculative Decoding] Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation

#55 DefTruth closed 1 month ago
0
🔥[SJF Scheduling] Efficient LLM Scheduling by Learning to Rank

#54 DefTruth closed 1 month ago
0
🔥[Decentralized LLM] Decentralized LLM Inference over Edge Networks with Energy Harvesting

#53 DefTruth closed 1 month ago
0
🔥[ACTIVATION SPARSITY] TRAINING-FREE ACTIVATION SPARSITY IN LARGE LANGUAGE MODELS

#52 DefTruth closed 1 month ago
0
Add NanoFlow code link

#51 DefTruth closed 1 month ago
0
Bump up to v2.1

#50 DefTruth closed 1 month ago
0
🔥🔥[FLA] FLA: A Triton-Based Library for Hardware-Efficient Implementa…

#49 DefTruth closed 1 month ago
0
🔥[1-bit LLMs] Matmul or No Matmal in the Era of 1-bit LLMs

#48 DefTruth closed 1 month ago
0
🔥🔥[MARLIN] MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

#47 DefTruth closed 1 month ago
0
Add ABQ-LLM code link

#46 DefTruth closed 1 month ago
0
add code link【ABQ-LLM 】

#45 lswzjuer closed 1 month ago
2
🔥[MagicDec] MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

#44 DefTruth closed 1 month ago
0
🔥[NanoFlow] NanoFlow: Towards Optimal Large Language Model Serving Throughput

#43 DefTruth closed 1 month ago
0
🔥[FocusLLM] FocusLLM: Scaling LLM’s Context by Parallel Decoding

#42 DefTruth closed 1 month ago
0
🔥[Speculative Decoding] Parallel Speculative Decoding with Adaptive Draft Length

#41 DefTruth closed 1 month ago
0
Update README.md

#40 DefTruth closed 1 month ago
0
Bump up to v2.0

#39 DefTruth closed 1 month ago
0
[Token Recycling] Turning Trash into Treasure: Accelerating Inference…

#38 DefTruth closed 1 month ago
0
🔥[ABQ-LLM] Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

#37 DefTruth closed 1 month ago
0
Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference

#36 DefTruth closed 1 month ago
0
KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning

#35 DefTruth closed 1 month ago
0
🔥🔥[Eigen Attention] Attention in Low-Rank Space for KV Cache Compression

#34 DefTruth closed 1 month ago
0
🔥🔥[LUT TENSOR CORE] Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration

#33 DefTruth closed 1 month ago
0
Bump up to v1.9

#32 DefTruth closed 1 month ago
0
🔥🔥[500xCompressor] 500xCompressor: Generalized Prompt Compression for…

#31 DefTruth closed 1 month ago
0
🔥[Automatic Inference Engine Tuning] Towards SLO-Optimized LLM Servin…

#30 DefTruth closed 1 month ago
0