Open Aidenzich opened 1 month ago
Section | Details |
---|---|
Problem | LLMs consume significant memory due to the key-value (KV) cache, especially in models with many layers, hindering real-world deployment. |
Proposed Solution | Layer-Condensed KV Cache: Computes and caches KV for fewer layers by using keys and values only from the top layer, reducing memory and improving throughput. |
Pairs the queries of all layers with the keys and values of the top layer. | |
Omits key-value computation and discards parameters where not needed, improving memory and model size. | |
Inspiration | Based on the idea that the top layer holds the most refined token representation. |
Maintaining Performance | Uses warmup layers with standard attention at both the top and bottom ("sandwich" structure) to prevent performance degradation. |
Training Challenges & Solutions | Sequential dependencies limit parallel training, but an approximate method enables parallelization. |
Gradient stopping limits backpropagation to reduce memory consumption during training. | |
Fast convergence of KV pairs minimizes the number of forward propagation iterations needed. | |
Handling Prompts | Iterative computation for prompt encoding is required, but fast KV convergence minimizes the added time. |
Experimental Results | Supports up to 32x larger batch sizes. |
Achieves up to 26x higher throughput. | |
Comparable performance in language modeling and downstream tasks. | |
Can integrate with techniques like StreamingLLM for additional efficiency. | |
Limitations | Training time increases due to the iterative process. |
Throughput decreases when the prompt is much longer than the generated text. |