LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding - Githubissues

Aidenzich / road-to-master

A repo to store our research footprint on AI

MIT License

19 stars 4 forks source link

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding #64

Open Aidenzich opened 3 weeks ago

Aidenzich commented 3 weeks ago

https://arxiv.org/pdf/2404.16710
Diagram

Aidenzich commented 3 weeks ago

What is early exit loss?

Early Exit Loss	Description	Difference from Traditional Cross-Entropy
Formula	`J(X,Y,t) = Σ_(l=0)^(L-1) ε(t,l) * J_CE(g(x_(l+1)), Y)`	Traditional cross-entropy loss calculates loss only after the final layer (L), while early exit loss calculates and aggregates losses at multiple layers (0 to L-1).
Components	- `ε(t,l)`: Normalized per-layer loss scale. - `J_CE(g(x_(l+1)), Y)`: Cross-entropy loss calculated at each layer `l`. - `C(t,l)`: Binary curriculum function controlling when early exit loss is applied to a layer. - `e(l)`: Scale increasing across layers, giving higher weight to later layers. - `e_scale`: Hyperparameter controlling the scaling of early exit loss.	Traditional cross-entropy uses no scaling or curriculum learning. It focuses solely on the output of the final layer.
Purpose	To train models capable of exiting early during inference, at different layers, while maintaining reasonable accuracy. This is achieved by applying a weighted cross-entropy loss at each layer during training.	Aims to optimize the model's performance only based on the final layer's output.
Curriculum	Two curricula are used to prevent performance degradation: - Rotational: Enables early exit loss at every R layers in a rotating fashion. - Gradual: Gradually enables early exit loss starting from the last layer (L-1) and moving towards the first layer (0).	Not applicable to traditional cross-entropy.
Impact on Training	Applying early exit loss to all layers at all times slows down training and can reduce overall accuracy. Hence, the curricula are essential for mitigating these negative effects.	No such considerations exist for traditional cross-entropy as it only involves computation at the last layer.

Aidenzich commented 3 weeks ago

What is most related reference for this paper?

Title	Venue/Journal	Relevance
Attention is all you need	Advances in Neural Information Processing Systems, volume 30	Foundational work on Transformer models.
BERxiT: Early exiting for BERT with better fine-tuning and extension to regression	Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume	Relevant to early exiting strategies in Transformer models.
Learning to skip for language modeling	ArXiv	Directly related to skipping layers in language modeling.
Draft & verify: Lossless large language model acceleration via self-speculative decoding		Explores self-speculative decoding for LLM acceleration, a core concept of the paper.
Accelerating training of transformer-based language models with progressive layer dropping	Advances in Neural Information Processing Systems, volume 33	Discusses layer dropping in Transformers, relevant to the layer skipping concept.
Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference		Highly relevant as it discusses skip decoding for efficient LLM inference.
Jump to conclusions: Short-cutting transformers with linear transformations		Focuses on short-cutting Transformers, which connects to the layer skipping idea.
Depth-adaptive transformer	ICLR	Relevant to adapting Transformer depth, similar to dynamically skipping layers.
Reducing transformer depth on demand with structured dropout	ICLR	Deals with reducing Transformer depth, a related concept.
Speed: Speculative pipelined execution for efficient decoding		Relevant due to its focus on speculative decoding for efficient inference.
Fast inference from transformers via speculative decoding	ICML	Discusses fast inference using speculative decoding, a key aspect of the paper.
Layer-wise pruning of transformer attention heads for efficient language modeling	2021 18th International SoC Design Conference (ISOCC)	Focuses on pruning attention heads for efficiency, which relates to optimizing Transformer layers.
Confident adaptive language modeling	Advances in Neural Information Processing Systems, 2022	Relevant to adaptive computation in language models.