Glaciohound / LM-Infinite

Implementation of NAACL 2024 Outstanding Paper "LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models"
https://arxiv.org/abs/2308.16137
MIT License
124 stars 13 forks source link

Improve GPU memory usage but slower inference speed? #8

Open ys-zong opened 6 months ago

ys-zong commented 6 months ago

Hi, thanks for the nice work! I tried to use the following code to enable LM-Infinite for Llama following Readme,

model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype=torch.bfloat16, device_map="cuda", low_cpu_mem_usage=True)

from models.llama import convert_llama_model
model = convert_llama_model(model, 4096, 10)

and then do the inference as usual. The GPU memory usage is lower than using regular attention but the inference speed becomes much slower (like 10x slower). I'm using A100 GPU and I checked the GPU-Util: it's very low ~10%. I I wonder if you have any idea why it happens? Many thanks.