This is the codes of the paper LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models (NAACL 2024 Outstanding Paper award) in PyTorch. The work is done by Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, Sinong Wang.
In this paper, the authors propose a simple method, called LM-Infinite, to improve the length generalization of large language models to an extreme length of 200M tokens, without any additional training or parameter updates.
We are motivatedby first identifying three factors underlying the length generalization failure in LLMs: (a) Factor 1: Unseen distances between tokens cause attention logits to explode. (b) Factor 2: An unseen number of tokens can cause attention entropy to increase beyond the training range as the length increases. (c) Factor 3: Starting few tokens occupy a distinct feature region and should not be discarded.
The key idea is to use (1) a $\Lambda$-shaped attention pattern, so that each token only attends to the nearest $L{pretrain}$ tokens as well as a few starting tokens, and (2) a distance limit $L{pretrain}$, so that the attention distance is capped at $L_{pretrain}$. The proposed method is compatible with multiple state-of-the-art language models, including but not limited to LLaMA, Llama-2, GPT-J, MPT-7B series. LM-Infinite is also computational efficient, with only $O(n)$ time complexity.
We have implemented the LM-Infinite method as a drop-in replacement for HuggingFace Transformers. After you load the Transformers models, and if it is a Llama model, an MPT model, or a GPT-J model, you can run the following codes to enable LM-Infinite.
For Llama model:
from models.llama import convert_llama_model
model = convert_llama_model(model, 4096, 10)
For MPT model:
from models.mpt_7b import convert_mpt_model
model = convert_mpt_model(model, 4096, 10)
For GPT-J model:
from models.gpt_j import convert_gpt_j_model
model = convert_gpt_j_model(model, 4096, 10)
Then, you can use the model as usual!
A detailed list of python packages from an Anaconda perspective can be found in requirements.txt
.
Some packages were installed by conda
and some by pip
.
My commands to install the requirements in Anaconda & Pip environment are as follows:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
conda install -c conda-forge sentencepiece einops cudatoolkit-dev tqdm ipython datasets evaluate rouge-score protobuf accelerate langchain openai
pip install transformers deepspeed
├── LICENSE
├── README.md
├── requirements.txt
├── configs
│ └── zero3_efficient_config.json # config for deepspeed acceleration
├── data
│ ├── generation_metrics.py
│ ├── get_data.py # dataset loading and preprocessing
│ ├── passkey_retrieval
│ │ ├── create_passkey_data.py
│ │ ├── create_passkey_data.sh
│ │ └── passkey_retrieval_accuracy.py
│ └── split_pile_file.py # split the Pile dataset into task-specific files
├── models
│ ├── constant.py # a constant function model
│ ├── get_llama2
│ │ ├── convert_llama_weights_to_hf.py # convert llama-2 weights to huggingface format
│ │ └── download_llama2.sh
│ ├── get_model.py
│ ├── gpt_j.py
│ ├── lambda_attention.py # efficient implementation of lambda attention
│ ├── llama.py
│ ├── model_base.py
│ └── mpt_7b.py
├── scripts
│ ├── combine_evaluate_generation.py
│ ├── combine_results.py
│ ├── eval_downstream_tasks.py # evaluate on passkey retrieval task
│ ├── eval_generation.py # evaluate generation metrics
│ └── eval_ppl_deepspeed.py # evaluate perplexity
├── utils
│ ├── arguments.py
│ └── utils.py
└── visualization
├── plot_nll.py
├── position_pca.py
└── relative_attention_explosion.py
For datasets, you need to prepared a corpus dataset.
If you download the the original Pile source (https://pile.eleuther.ai) to ${PILE_PATH}/test.jsonl.zst
and ${PILE_PATH}/val.jsonl.zst
, run the following commands to extract the compressed dataset.
cd ${PILE_PATH}
zstd -d ./ test.jsonl.zst
zstd -d ./ val.jsonl.zst
Then run the following commands to split the dataset into task-specific files.
cd ${REPOSITORY_ROOT}
mkdir -p ${PILE_PATH}/val
mkdir -p ${PILE_PATH}/test
python data/split_pile_file.py ${PILE_PATH}/val.jsonl ${PILE_PATH}/val
python data/split_pile_file.py ${PILE_PATH}/test.jsonl ${PILE_PATH}/test
However the official Pile does not seem to be available for download anymore, so you probably need to figure out another source(e.g., https://huggingface.co/datasets/arxiv_dataset or https://openwebtext2.readthedocs.io/en/latest/). Alternatively, you can also use your own corpus. Both two options require you to edit data/get_data.py.
For backbone models, the paper uses Llama-2, LLaMA, GPT-J, and MPT-7B. The last 3 models are directly available on-the-fly from HuggingFace model hub so not action is needed beforehand. The Llama-2 download key needs to be requested from Meta AI request form. Then run the following command
bash models/get_llama2/download_llama2.sh
and follow prompts to download the checkpoints to ${PATH_TO_LLAMA2_CHECKPOINTS}
.
Then run
python models/get_llama2/convert_llama_weights_to_hf.py \
--input_dir ${PATH_TO_LLAMA2_CHECKPOINTS} \
--model_size 7B \
--output_dir ${PATH_TO_LLAMA2_CHECKPOINTS}/llama-2-7b-hf
to convert the llama-2-7b checkpoints to huggingface format.
The codes requires a ${LOG_DIR}
to store the logs and results.
Please select a directory with enough space.
Evaluating the perplexity of Llama-2 model on ArXiv test set.
TRIAL=llama2-infinite-ArXiv
mkdir -p $LOG_DIR/$TRIAL
CUDA_VISIBLE_DEVICES=0
MASTER_PORT=$(shuf -i 29500-65535 -n 1)
DS_SKIP_CUDA_CHECK=1 PYTHONPATH=. deepspeed --include localhost:$CUDA_VISIBLE_DEVICES --master_port $MASTER_PORT scripts/eval_ppl_deepspeed.py \
--deepspeed_config configs/zero3_efficient_config.json \
--model ${PATH_TO_LLAMA2_CHECKPOINTS}/llama-2-7b-hf --tokenizer_path ${PATH_TO_LLAMA2_CHECKPOINTS} \
--use_lambda_attention --local_branch 4096 --global_branch 100 --limit_distance 4096 \
--dataset the_pile --dataset_group ArXiv --split test --dataset_dir ${PILE_PATH} \
--max_length 32770 \
--log_dir $LOG_DIR/$TRIAL
A brief explanation of the arguments:
--model
: the path or name to model. Pass decapoda-research/llama-7b-hf
to use LLaMA, mosaicml/mpt-7b
to use MPT-7B, and EleutherAI/gpt-j-6b
to use GPT-J-6B.--tokenizer_path
: the path to the tokenizer. Remove this argument if not using Llama-2.--use_lambda_attention
: use lambda attention. (Required for LM-Infinite)--local_branch
: the local branch size. 2048 for LLaMA, MPT-7B and GPT-J (Required for LM-Infinite)--global_branch
: the global branch size. Range 10-100 gives generally similar effect. (Required for LM-Infinite)--limit_distance
: the distance limit. 2048 for LLaMA, MPT-7B and GPT-J (Required for LM-Infinite)--dataset
: the dataset name. See data/get_data.py to figure how to use custom datasets.If you want to evaluate on vanilla models without LM-Infinite, simply remove the
--use_lambda_attention --local_branch 4096 --global_branch 100 --limit_distance 4096
argument set.
If you want only to evaluate on a subset of the test set, you can use the --start_data_from
argument to specify the starting index of the test set, and/or --max_data_num
to specify the number of examples after that index.
TRIAL=llama2-infinite-ArXiv-extreme
CUDA_VISIBLE_DEVICES=0
MASTER_PORT=$(shuf -i 29500-65535 -n 1)
echo port: $MASTER_PORT
mkdir -p $LOG_DIR/$TRIAL
DS_SKIP_CUDA_CHECK=1 PYTHONPATH=. deepspeed --include localhost:$CUDA_VISIBLE_DEVICES --master_port $MASTER_PORT scripts/eval_infinite_ppl.py \
--deepspeed_config configs/zero3_efficient_config.json \
--model ${PATH_TO_LLAMA2_CHECKPOINTS}/llama-2-7b-hf --tokenizer_path ${PATH_TO_LLAMA2_CHECKPOINTS} \
--use_lambda_attention --local_branch 4096 --global_branch 10 --limit_distance 4096 \
--dataset the_pile --dataset_group ArXiv --split test --dataset_dir ${PILE_PATH} \
--streaming_length 200000000 --max_length 128000 --start_data_from 2300 \
--log_dir $LOG_DIR/$TRIAL
Generating evaluation from Llama-2 model on ArXiv test set.
TRIAL=llama2-infinite-generate-ArXiv
mkdir -p $LOG_DIR/$TRIAL
CUDA_VISIBLE_DEVICES=0
MASTER_PORT=$(shuf -i 29500-65535 -n 1)
DS_SKIP_CUDA_CHECK=1 PYTHONPATH=. deepspeed --include localhost:$CUDA_VISIBLE_DEVICES --master_port $MASTER_PORT scripts/eval_generation.py \
--deepspeed_config configs/zero3_efficient_config.json \
--model ${PATH_TO_LLAMA2_CHECKPOINTS}/llama-2-7b-hf --tokenizer_path ${PATH_TO_LLAMA2_CHECKPOINTS} \
--use_lambda_attention --local_branch 4096 --global_branch 100 --limit_distance 4096 \
--dataset the_pile --dataset_group ArXiv --split test --dataset_dir ${PILE_PATH} \
--max_length 33000 \
--max_generation_length 100 --evaluate_metrics --evaluate_positions 4096 8192 12288 16384 \
--log_dir $LOG_DIR/$TRIAL
First, we need to prepare the passkey retrieval dataset.
for MAX_LENGTH in 2048 3072 4096 5120 6144 7168 8192 10240 12288 14335 16384; do
echo $MAX_LENGTH
python data/passkey_retrieval/create_passkey_data.py \
--token-length $MAX_LENGTH \
--dump-file-path ${PASSKEY_DATA}/${MAX_LENGTH} \
--tokenizer-path ${PATH_TO_LLAMA2_CHECKPOINTS} \
--num-samples 1000
done
Then, let us evaluate the passkey retrieval task.
CUDA_VISIBLE_DEVICES=0
for MAX_LENGTH in 6144 8192 10240 12288 16384; do
TRIAL=llama2-infinite-passkey-$MAX_LENGTH
mkdir -p $LOG_DIR/$TRIAL
MASTER_PORT=$(shuf -i 29500-65535 -n 1)
DS_SKIP_CUDA_CHECK=1 PYTHONPATH=. deepspeed --master_port $MASTER_PORT --include localhost:$CUDA_VISIBLE_DEVICES scripts/eval_downstream_tasks.py \
--deepspeed_config configs/zero3_efficient_config.json \
--model ${PATH_TO_LLAMA2_CHECKPOINTS}/llama-2-7b-hf --tokenizer_path ${PATH_TO_LLAMA2_CHECKPOINTS} \
--use_lambda_attention --local_branch 4096 --global_branch 10 --limit_distance 4096 --triangle_offset 0 \
--top_k_attention 5 --top_k_from_layer 4 \
--dataset passkey_retrieval --dataset_dir ${PASSKEY_DATA} --dataset_group ${MAX_LENGTH} \
--max_generation_length 7 --evaluate_metrics \
--log_dir $LOG_DIR/$TRIAL
done
Running the Qasper task:
CUDA_VISIBLE_DEVICES=0
DATASET=qasper
TRIAL=llama2-infinite-$DATASET
mkdir -p $LOG_DIR/$TRIAL
MASTER_PORT=$(shuf -i 29500-65535 -n 1)
echo port: $MASTER_PORT
DS_SKIP_CUDA_CHECK=1 PYTHONPATH=. deepspeed --include localhost:$CUDA_VISIBLE_DEVICES --master_port $MASTER_PORT scripts/eval_downstream_tasks.py \
--deepspeed_config configs/zero3_efficient_config_large.json \
--model ${PATH_TO_LLAMA2_CHECKPOINTS}/llama-2-7b-hf --tokenizer_path ${PATH_TO_LLAMA2_CHECKPOINTS} \
--use_lambda_attention --local_branch 4096 --global_branch 10 --limit_distance 4096 --triangle_offset 0 \
--top_k_attention 5 --top_k_from_layer 4 \
--dataset $DATASET --split test --evaluate_metrics \
--max_length 6144 --truncation_side center \
--log_dir $LOG_DIR/$TRIAL
@inproceedings{han2024lm,
title={LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models},
author={Han, Chi and Wang, Qifan and Peng, Hao and Xiong, Wenhan and Chen, Yu and Ji, Heng and Wang, Sinong},
booktitle={Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
pages={3991--4008},
year={2024}
}