Layer-Condensed KV Cache for Efficient Inference of Large Language Models - Githubissues

Aidenzich / road-to-master

A repo to store our research footprint on AI

MIT License

19 stars 4 forks source link

Layer-Condensed KV Cache for Efficient Inference of Large Language Models #63

Open Aidenzich opened 1 month ago

Aidenzich commented 1 month ago

https://arxiv.org/pdf/2405.10637, ACL'24

Screenshot 2024-10-12 at 3 00 59 PM

Section	Details
Problem	LLMs consume significant memory due to the key-value (KV) cache, especially in models with many layers, hindering real-world deployment.
Proposed Solution	Layer-Condensed KV Cache: Computes and caches KV for fewer layers by using keys and values only from the top layer, reducing memory and improving throughput.
	Pairs the queries of all layers with the keys and values of the top layer.
	Omits key-value computation and discards parameters where not needed, improving memory and model size.
Inspiration	Based on the idea that the top layer holds the most refined token representation.
Maintaining Performance	Uses warmup layers with standard attention at both the top and bottom ("sandwich" structure) to prevent performance degradation.
Training Challenges & Solutions	Sequential dependencies limit parallel training, but an approximate method enables parallelization.
	Gradient stopping limits backpropagation to reduce memory consumption during training.
	Fast convergence of KV pairs minimizes the number of forward propagation iterations needed.
Handling Prompts	Iterative computation for prompt encoding is required, but fast KV convergence minimizes the added time.
Experimental Results	Supports up to 32x larger batch sizes.
	Achieves up to 26x higher throughput.
	Comparable performance in language modeling and downstream tasks.
	Can integrate with techniques like StreamingLLM for additional efficiency.
Limitations	Training time increases due to the iterative process.
	Throughput decreases when the prompt is much longer than the generated text.