Aidenzich / road-to-master

A repo to store our research footprint on AI
MIT License
19 stars 4 forks source link

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding #64

Open Aidenzich opened 3 weeks ago

Aidenzich commented 3 weeks ago
Aidenzich commented 3 weeks ago

What is early exit loss?

Early Exit Loss Description Difference from Traditional Cross-Entropy
Formula J(X,Y,t) = Σ_(l=0)^(L-1) ε(t,l) * J_CE(g(x_(l+1)), Y) Traditional cross-entropy loss calculates loss only after the final layer (L), while early exit loss calculates and aggregates losses at multiple layers (0 to L-1).
Components - ε(t,l): Normalized per-layer loss scale.
- J_CE(g(x_(l+1)), Y): Cross-entropy loss calculated at each layer l.
- C(t,l): Binary curriculum function controlling when early exit loss is applied to a layer.
- e(l): Scale increasing across layers, giving higher weight to later layers.
- e_scale: Hyperparameter controlling the scaling of early exit loss.
Traditional cross-entropy uses no scaling or curriculum learning. It focuses solely on the output of the final layer.
Purpose To train models capable of exiting early during inference, at different layers, while maintaining reasonable accuracy. This is achieved by applying a weighted cross-entropy loss at each layer during training. Aims to optimize the model's performance only based on the final layer's output.
Curriculum Two curricula are used to prevent performance degradation:
- Rotational: Enables early exit loss at every R layers in a rotating fashion.
- Gradual: Gradually enables early exit loss starting from the last layer (L-1) and moving towards the first layer (0).
Not applicable to traditional cross-entropy.
Impact on Training Applying early exit loss to all layers at all times slows down training and can reduce overall accuracy. Hence, the curricula are essential for mitigating these negative effects. No such considerations exist for traditional cross-entropy as it only involves computation at the last layer.
Aidenzich commented 3 weeks ago

What is most related reference for this paper?

Title Venue/Journal Relevance
Attention is all you need Advances in Neural Information Processing Systems, volume 30 Foundational work on Transformer models.
BERxiT: Early exiting for BERT with better fine-tuning and extension to regression Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume Relevant to early exiting strategies in Transformer models.
Learning to skip for language modeling ArXiv Directly related to skipping layers in language modeling.
Draft & verify: Lossless large language model acceleration via self-speculative decoding Explores self-speculative decoding for LLM acceleration, a core concept of the paper.
Accelerating training of transformer-based language models with progressive layer dropping Advances in Neural Information Processing Systems, volume 33 Discusses layer dropping in Transformers, relevant to the layer skipping concept.
Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference Highly relevant as it discusses skip decoding for efficient LLM inference.
Jump to conclusions: Short-cutting transformers with linear transformations Focuses on short-cutting Transformers, which connects to the layer skipping idea.
Depth-adaptive transformer ICLR Relevant to adapting Transformer depth, similar to dynamically skipping layers.
Reducing transformer depth on demand with structured dropout ICLR Deals with reducing Transformer depth, a related concept.
Speed: Speculative pipelined execution for efficient decoding Relevant due to its focus on speculative decoding for efficient inference.
Fast inference from transformers via speculative decoding ICML Discusses fast inference using speculative decoding, a key aspect of the paper.
Layer-wise pruning of transformer attention heads for efficient language modeling 2021 18th International SoC Design Conference (ISOCC) Focuses on pruning attention heads for efficiency, which relates to optimizing Transformer layers.
Confident adaptive language modeling Advances in Neural Information Processing Systems, 2022 Relevant to adaptive computation in language models.