Open Aidenzich opened 3 weeks ago
Early Exit Loss | Description | Difference from Traditional Cross-Entropy |
---|---|---|
Formula | J(X,Y,t) = Σ_(l=0)^(L-1) ε(t,l) * J_CE(g(x_(l+1)), Y) |
Traditional cross-entropy loss calculates loss only after the final layer (L), while early exit loss calculates and aggregates losses at multiple layers (0 to L-1). |
Components | - ε(t,l) : Normalized per-layer loss scale. - J_CE(g(x_(l+1)), Y) : Cross-entropy loss calculated at each layer l . - C(t,l) : Binary curriculum function controlling when early exit loss is applied to a layer. - e(l) : Scale increasing across layers, giving higher weight to later layers. - e_scale : Hyperparameter controlling the scaling of early exit loss. |
Traditional cross-entropy uses no scaling or curriculum learning. It focuses solely on the output of the final layer. |
Purpose | To train models capable of exiting early during inference, at different layers, while maintaining reasonable accuracy. This is achieved by applying a weighted cross-entropy loss at each layer during training. | Aims to optimize the model's performance only based on the final layer's output. |
Curriculum | Two curricula are used to prevent performance degradation: - Rotational: Enables early exit loss at every R layers in a rotating fashion. - Gradual: Gradually enables early exit loss starting from the last layer (L-1) and moving towards the first layer (0). |
Not applicable to traditional cross-entropy. |
Impact on Training | Applying early exit loss to all layers at all times slows down training and can reduce overall accuracy. Hence, the curricula are essential for mitigating these negative effects. | No such considerations exist for traditional cross-entropy as it only involves computation at the last layer. |
Title | Venue/Journal | Relevance |
---|---|---|
Attention is all you need | Advances in Neural Information Processing Systems, volume 30 | Foundational work on Transformer models. |
BERxiT: Early exiting for BERT with better fine-tuning and extension to regression | Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume | Relevant to early exiting strategies in Transformer models. |
Learning to skip for language modeling | ArXiv | Directly related to skipping layers in language modeling. |
Draft & verify: Lossless large language model acceleration via self-speculative decoding | Explores self-speculative decoding for LLM acceleration, a core concept of the paper. | |
Accelerating training of transformer-based language models with progressive layer dropping | Advances in Neural Information Processing Systems, volume 33 | Discusses layer dropping in Transformers, relevant to the layer skipping concept. |
Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference | Highly relevant as it discusses skip decoding for efficient LLM inference. | |
Jump to conclusions: Short-cutting transformers with linear transformations | Focuses on short-cutting Transformers, which connects to the layer skipping idea. | |
Depth-adaptive transformer | ICLR | Relevant to adapting Transformer depth, similar to dynamically skipping layers. |
Reducing transformer depth on demand with structured dropout | ICLR | Deals with reducing Transformer depth, a related concept. |
Speed: Speculative pipelined execution for efficient decoding | Relevant due to its focus on speculative decoding for efficient inference. | |
Fast inference from transformers via speculative decoding | ICML | Discusses fast inference using speculative decoding, a key aspect of the paper. |
Layer-wise pruning of transformer attention heads for efficient language modeling | 2021 18th International SoC Design Conference (ISOCC) | Focuses on pruning attention heads for efficiency, which relates to optimizing Transformer layers. |
Confident adaptive language modeling | Advances in Neural Information Processing Systems, 2022 | Relevant to adaptive computation in language models. |