SNU - Natural Language Processing, Final Project
2023 - Autumn
The project aims to improve the generation speed of Large Language Models (LLMs) by using smaller models (drafters) to generate token drafts in parallel. These drafts are then accepted or replaced by the LLM. Two approaches are considered: using Knowledge Distillation to transfer the target model's distribution to smaller models and generating a pseudo-dataset to train non-neural models. Experiments will be conducted on summarization tasks using the CNN/DailyMail dataset, with the target model being FLAN-T5-XL and drafters including N-gram models, CNN, LSTM, and T5-small. Evaluation will compare acceptance rates between distilled and baseline models.
종윤 (@ArtemisDicoTiar), 재석 (@medduk9871), 원표 (@lenscloth), 재진 (@jjkim0807), Romain (@RomainStorai)
Large Language Models (LLMs) are struggling on generation speed as it requires auto-regressive generation (sequential token by token generation). This is mainly caused by a memory bottleneck. We can therefore use in parallel a condensed model of the target model (orginial LLM that we want to speed-up) that will draft tokens. The drafts can be accepted by the system or can be replaced by the LLM generation. It is then possible to generate in one step at least one token (no time lost) and at most $\gamma + 1$ tokens if we are lucky.
Our goal is to make the drafts the more likely to be accepted by the target model (tend to generate $\gamma + 1$ tokens), which enables the generation to be accelerated. Therefore, in this project, we aim to infuse the target model distribution to the drafters to increase the possibility of drafter model generation acceptance rate.
For each models, we will compare the $\alpha = E(\beta)$ paramater (with $\beta$ being the acceptance rate of the drafts) between a distilled model and a baseline. The baseline is a model fine-tuned on the dataset, while the distilled model is a model after Knowledge Distillation.
Distillation Process: We use the Kullback-Leibler Divergence loss function.
$L{KD} = D{KL}( \log(\text{softmax}(input{logits} / T)) / \text{softmax}(target{logits} / T))$
Summarization, CNN/DailyMail
$ pip install pipenv
$ pipenv install
$ python run.py train \
--drafter <drafter model> \
--exp_name <experiment name> \
--loss <loss value> \
--kd_loss <knowledge distillation loss> \
--kd_temp <knowledge distillation temperature> \
--distill_method <distillation method> \
run
$ python run.py eval \
--ckpt_path <path to checkpoint> \
--device 0 \
run
Since NgramModel isn't a deep learning model, the process is different for the "training" (making of the ngrams)
./FastLLM/scripts/generate_peudodataset_multigpus.sh
after modifying the parameters in the script.$ python run.py train \
--drafter ngram \
--exp_name <experiment name> \
--ngram_n <size of gram> \
--pseudo_dataset_path <path to the generated dataset OR do not include to train on original> \
run
Local checkpoints...
/data/romsto/ngrams_ckpts/dataset/
/data/romsto/ngrams_ckpts/pseudo_dataset/
/data/jaeseok/data/nlp/models/
/data/jaeseok/data/nlp/models/cnn_cnn-drafter-2023.pt
/data/jaeseok/data/nlp/models/
/data/jaeseok/data/nlp/models/lstm_lstm-drafter-2023.pt
/data/wppark/Workspace/FastLLM/t5small_baseline-drafter-2023.pt
/data/wppark/Workspace/FastLLM/t5small_kd50-drafter-2023.pt
The NgramModel
is a torch implementation of a naive n-gram model. It is used to compute the probability of the next token given the previous prefix. The model is built using a map of n-grams and their corresponding counts, which are used to calculate the probabilities.
The model uses Laplace smoothing to handle the case of unseen n-grams in the training data. The smoothing parameter can be adjusted during the model initialization.
The NgramModel
class provides custom methods for saving and loading the model. The model is saved in a text format, which includes the n-gram counts and total counts. To load a saved model, you should input the path of the saved model in the resume
parameter of the constructor.
The model also uses backoff models for handling cases where the prefix length is less than n-1. The backoff model is an (n-1)-gram model that is used when the current n-gram model cannot provide a prediction.
Please note that the fit
method is not a training method in the traditional sense, as the model is not trainable. It simply builds the n-gram counts based on the given data.
@article{nallapati2016abstractive,
title={Abstractive text summarization using sequence-to-sequence rnns and beyond},
author={Nallapati, Ramesh and Zhou, Bowen and Gulcehre, Caglar and Xiang, Bing and others},
journal={arXiv preprint arXiv:1602.06023},
year={2016}
}
@inproceedings{leviathan2023fast,
title={Fast inference from transformers via speculative decoding},
author={Leviathan, Yaniv and Kalman, Matan and Matias, Yossi},
booktitle={International Conference on Machine Learning},
pages={19274--19286},
year={2023},
organization={PMLR}
}
@article{spector2023accelerating,
title={Accelerating llm inference with staged speculative decoding},
author={Spector, Benjamin and Re, Chris},
journal={arXiv preprint arXiv:2308.04623},
year={2023}
}
@article{zhang2023draft,
title={Draft \& Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding},
author={Zhang, Jun and Wang, Jue and Li, Huan and Shou, Lidan and Chen, Ke and Chen, Gang and Mehrotra, Sharad},
journal={arXiv preprint arXiv:2309.08168},
year={2023}
}
@article{kim2023big,
title={Big little transformer decoder},
author={Kim, Sehoon and Mangalam, Karttikeya and Malik, Jitendra and Mahoney, Michael W and Gholami, Amir and Keutzer, Kurt},
journal={arXiv preprint arXiv:2302.07863},
year={2023}
}
@misc{gu_knowledge_2023,
title = {Knowledge {Distillation} of {Large} {Language} {Models}},
url = {http://arxiv.org/abs/2306.08543},
doi = {10.48550/arXiv.2306.08543},
publisher = {arXiv},
author = {Gu, Yuxian and Dong, Li and Wei, Furu and Huang, Minlie},
year = {2023},
}
@misc{hinton_distilling_2015,
title = {Distilling the {Knowledge} in a {Neural} {Network}},
url = {http://arxiv.org/abs/1503.02531},
doi = {10.48550/arXiv.1503.02531},
publisher = {arXiv},
author = {Hinton, Geoffrey and Vinyals, Oriol and Dean, Jeff},
year = {2015},
}