abertsch72 / unlimiformer

Public repo for the NeurIPS 2023 paper "Unlimiformer: Long-Range Transformers with Unlimited Length Input"
MIT License
1.05k stars 77 forks source link

Hardware Requirement for Running Llama-2 inferences #61

Open shang-zhu opened 7 months ago

shang-zhu commented 7 months ago

Hi, I successfully ran the inferences with Llama-2-7b and unlimiformer but ran into memory errors when jumped to larger models. What are the minimum GPU memory requirements for running 13b and 70b models? Thank you!

abertsch72 commented 7 months ago

Thanks for your interest in our work!

The memory required depends on two things:

  1. The base memory needed for that model (as you'd expect!). I haven't personally tried the 70b model, but this NVIDIA guide gives numbers that look pretty reasonable to me:

The file size of the model varies on how large the model is. Llama2-7B-Chat requires about 30GB of storage. Llama2-13B-Chat requires about 50GB of storage. Llama2-70B-Chat requires about 150GB of storage.

  1. The number of layers you apply Unlimiformer at. The good news here is that the additional cost from Unlimiformer doesn't depend on the model size (since we're only saving hidden states, and the models all have the same hidden dimension). You can calculate this for your input/use case by looking at the difference in GPU memory used between your 7b Llama+Unlimiformer setup and the base 7b model.

As a general recipe, I'd guess (amount of memory for the model) + (2-3 GBs per layer you'd like to apply Unlimiformer at) will get you pretty close to the amount needed, but this depends on how long your inputs are and whether you choose flat or trained indices.

zhangwenhao666 commented 5 months ago

I would like to ask about the input of about 100,000 tokens. Using the llama2-13b model, how long does it take to run it on an H100 graphics card?