Aidenzich / road-to-master

A repo to store our research footprint on AI
MIT License
19 stars 4 forks source link

Layer-Condensed KV Cache for Efficient Inference of Large Language Models #63

Open Aidenzich opened 1 month ago

Aidenzich commented 1 month ago

Screenshot 2024-10-12 at 3 00 59 PM

Section Details
Problem LLMs consume significant memory due to the key-value (KV) cache, especially in models with many layers, hindering real-world deployment.
Proposed Solution Layer-Condensed KV Cache: Computes and caches KV for fewer layers by using keys and values only from the top layer, reducing memory and improving throughput.
Pairs the queries of all layers with the keys and values of the top layer.
Omits key-value computation and discards parameters where not needed, improving memory and model size.
Inspiration Based on the idea that the top layer holds the most refined token representation.
Maintaining Performance Uses warmup layers with standard attention at both the top and bottom ("sandwich" structure) to prevent performance degradation.
Training Challenges & Solutions Sequential dependencies limit parallel training, but an approximate method enables parallelization.
Gradient stopping limits backpropagation to reduce memory consumption during training.
Fast convergence of KV pairs minimizes the number of forward propagation iterations needed.
Handling Prompts Iterative computation for prompt encoding is required, but fast KV convergence minimizes the added time.
Experimental Results Supports up to 32x larger batch sizes.
Achieves up to 26x higher throughput.
Comparable performance in language modeling and downstream tasks.
Can integrate with techniques like StreamingLLM for additional efficiency.
Limitations Training time increases due to the iterative process.
Throughput decreases when the prompt is much longer than the generated text.