abertsch72 / unlimiformer

Public repo for the NeurIPS 2023 paper "Unlimiformer: Long-Range Transformers with Unlimited Length Input"
MIT License
1.05k stars 77 forks source link

Why is the inference so slow? #53

Closed cckao closed 7 months ago

cckao commented 10 months ago

Hi,

Unlimiformer is amazing and can really help me. However, the inference is so slow that I believe I might do something wrong. Please help me. Thank you.

The task was pretty simple. I asked the LM to optimize following Python codes:

# bad_python_codes.py
total = 0
total += 0
total += 1
total += 2
total += 3
total += 4

I run vanilla text generation with following command and model.generate(...) took 3 seconds to complete:

python run_generation.py \
--model_type llama \
--model_name_or_path /path/to/CodeLlama-13b-Instruct-hf \
--prefix "<s>[INST] <<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n Optimize following Python codes: " \
--prompt bad_python_codes.py \
--suffix " [/INST]" \
--test_unlimiformer False \
--fp16 \
--length 10 \
--use_datastore False \

While I enable Unlimiformer, model.generate(...) took 1 minute and 20 seconds to complete:

python run_generation.py \
--model_type llama \
--model_name_or_path /path/to/CodeLlama-13b-Instruct-hf \
--prefix "<s>[INST] <<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n Optimize following Python codes: " \
--prompt bad_python_codes.py \
--suffix " [/INST]" \
--test_unlimiformer True \
--fp16 \
--length 10 \
--layer_begin 0 \
--index_devices 1 \
--datastore_device 1 \
--use_datastore True \
AshwinRamachandran2002 commented 10 months ago

The FAISS retrieval takes a lot of time, it is being performed at every head and every layer

urialon commented 10 months ago

Hi @cckao and @AshwinRamachandran2002 , Thank you for your interest in our work!

Yes, indeed running Unlimiformer is slower. We found that using --layer_begin X with a value X that is at least half the number of layers (that is, if the model has 40 layers, X should be at least 20) helps both speed and the quality of the output.

Additionally, if your input is not too long (<10k tokens), using --use_datastore False may speed things up a bit.

Let us know if you have any questions! Best, Uri

cckao commented 10 months ago

Hi, @urialon and @AshwinRamachandran2002 ,

Thanks for your comments. --use_datastore False speeds up a lot.