Hi, thanks for the nice work! I tried to use the following code to enable LM-Infinite for Llama following Readme,
model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype=torch.bfloat16, device_map="cuda", low_cpu_mem_usage=True)
from models.llama import convert_llama_model
model = convert_llama_model(model, 4096, 10)
and then do the inference as usual. The GPU memory usage is lower than using regular attention but the inference speed becomes much slower (like 10x slower). I'm using A100 GPU and I checked the GPU-Util: it's very low ~10%. I I wonder if you have any idea why it happens? Many thanks.
Hi, thanks for the nice work! I tried to use the following code to enable LM-Infinite for Llama following Readme,
and then do the inference as usual. The GPU memory usage is lower than using regular attention but the inference speed becomes much slower (like 10x slower). I'm using A100 GPU and I checked the GPU-Util: it's very low ~10%. I I wonder if you have any idea why it happens? Many thanks.