Assaf Ben-Kish,
Itamar Zimerman,
Shady Abu-Hussein,
Nadav Cohen,
Amir Globerson,
Lior Wolf,
Raja Giryes
We present DeciMamba (Decimating-Mamba), the first context extension method for Mamba.
In synthetic tasks, as well as in real-world long-range NLP tasks, DeciMamba is able to extrapolate to sequences that are magnitudes longer than the ones seen during training. It does so while requiring less computational resources and doesn't require retraining:
git clone https://github.com/assafbk/DeciMamba.git
cd DeciMamba
To set up our environment, please run:
conda env create -f environment.yml
conda activate decimamba
Install Mamba:
pip install causal-conv1d==1.1.1
pip install mamba-ssm==1.1.1
Install the required submodule via:
git submodule init
git submodule update
In order to train/evaluate the Language Modeling task, the PG-19 dataset must be tokenized. This can be done using the following script:
python ./custom_datasets/tokenize_pg19.py
--eval_only
flag (training set tokenization is significantly longer than the others)../artifacts/ppl_test/pg19
. If a different location is desired, make sure to change the relevant paths in ./custom_datasets/pg19.py
and ./custom_datasets/tokenize_pg19.py
We uploaded the weights of the best DeciMamba and Mamba models for each task: | Eval ID | Task | Model type | Checkpoint |
---|---|---|---|---|
0 | Document Retrieval | DeciMamba-130m | 🤗 assafbk/decimamba-130m-squad-doc-ret | |
1 | Document Retrieval | Mamba-130m | 🤗 assafbk/mamba-130m-squad-doc-ret | |
2 | Passkey Retrieval | DeciMamba-130m | 🤗 assafbk/decimamba-130m-niah | |
3 | Passkey Retrieval | Mamba-130m | 🤗 assafbk/mamba-130m-niah | |
4 | Language Modeling | DeciMamba-130m | 🤗 assafbk/decimamba-130m-pg19 | |
5 | Language Modeling | Mamba-130m | 🤗 assafbk/mamba-130m-pg19 | |
6 | Passkey Retrieval - Save Data For Mamba Attn Maps | Mamba-130m | 🤗 assafbk/mamba-130m-niah |
To run the evaluation script:
python finetune_ssm.py --eval <eval_id> --device <device_id>
Arguments:
To run the training script:
python finetune_ssm.py
All training metrics are displayed in the wandb webpage.
The configuration file is ./configs/finetune_ssm_config.json
.
General configurations:
mamba_arch
: "deci" / "vanilla"model_device
: "cuda:x" # x is the device id (e.g. 2)Decimation configurations:
activate_decimation
: Initialized to truedecimation_beta
: Initialized to 0.5decimating_layers
: Initialized to [12]decimation_min_seq_len
: Minimal sequence length allowed after a decimation block. Initialized to 20.decimation_max_p_L_base
: Lbase. Initialized to 2000.Note that decimation is automatically disabled when mamba_arch != "deci".
Additional configurations:
output_dir
: Where to save model checkpoints. Initialized to cache_dir
: Huggingface cache dir for all models. Initialized to activate_logging
: Enables wandb logging. Initialized to True.save_steps
: Model saving interval. Initialized to 200 (Note: best model is always saved regardless of this value).eval_steps
: Model evaluation interval. Initialized to 10.max_step
: Maximal amount of training steps. Initialized to 20000.Check out ./configs/finetune_ssm_config.json
for more configurations.
In ./configs/finetune_ssm_config.json
set:
dataset
: "squad_retrieve"Special configurations:
multidoc_num_noise_docs_train
: Amount of documents to use in a training sample. Initialized to 11.multidoc_num_noise_docs_eval
: Amounts of documents to use during evaluation (all are evaluated in a single evaluation step). Initialized to [10, 20, 40, 80, 120, 160, 200].First, make sure that the additional submodule was cloned (see 'Additional Requirements - Passkey Retrieval' above).
Then, in ./configs/finetune_ssm_config.json
set:
dataset
: "niah_custom"Special configurations:
niah_train_set_size
: Initialized to 6144.niah_context_len_train
: Context length used during training. Initialized to 2000.niah_needle_depths_eval
: Needle depths used during evaluation. Initialized to [0, 0.25, 0.5, 0.75, 1].niah_context_lens_eval
: Context lengths used during evaluation. Initialized to [1000, 2000, 4000, 8000, 16000, 32000, 64000, 128000].First, make sure that the PG-19 dataset was tokenized (see 'Additional Requirements - Language Modeling' above).
Next, in ./configs/finetune_ssm_config.json
set:
dataset
: "ppl_test"Special configurations:
ppl_test_context_len_train
: Context length used during training. Initialized to 2000.ppl_test_context_lens_eval
: Context lengths used during evaluation. Initialized to [1000, 2000, 4000, 10000, 20000, 30000, 40000, 50000, 60000].ppl_test_num_windows_per_context_len_eval
: Amount of context windows evaluated per sample during evaluation. Initialized to 10.ppl_test_pred_len
: Amount of samples to evaluate per context length. Initialized to 100.DeciMamba does not currently support batched inference, therefore the batch size is set according to grad_accum_steps
.
We provide a simple automatic algorithm to find the first decimating layer, which doesnt require computing hidden attention. To use it, simply set find_deci_layer
to true.
Short explanation: similar to table 10 in the paper, we scan the first decimating layer, and select the one with the best perfomrance. The model should perform quite well when using only this layer as a DeciMamba layer.
To calculate Mamba's hidden attention maps:
python finetune_ssm.py --eval 6 --device <device_id>
calculate_mamba_attn_maps.ipynb
and run notebook.We thank the authors of Mamba for their amazing work and contribution: https://github.com/state-spaces/mamba
Additionally, we thank the authors of BABILong, as we use a version of their code for the passkey retrieval task: https://github.com/booydar/babilong
If you find this work useful, please cite the following:
@misc{benkish2024decimambaexploringlengthextrapolation,
title={DeciMamba: Exploring the Length Extrapolation Potential of Mamba},
author={Assaf Ben-Kish and Itamar Zimerman and Shady Abu-Hussein and Nadav Cohen and Amir Globerson and Lior Wolf and Raja Giryes},
year={2024},
eprint={2406.14528},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2406.14528},
}