Closed zhhvvv closed 3 months ago
Yes, current implementation does not consider any efficiency issues. Our flash attention version will come soon!
@zhhvvv have you tried:
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype= torch.float16, load_in_4bit= True, device_map="auto")
Hi! We just released the FlashAttention implementation with transformers==4.38.2. You may try it
I use 2 GPUs. During inference (with llama-2-7B, and 10k quetions in the demo), each GPU requires about 50GB of additional memory, excluding the load model. But when I use 8 GPUs, the inference process also requires about 50GB of memory per card. why is that?
Also, I can not test llama-2-70B model, because even if 8 GPUs are used for simultaneous inference, the GPU memory required by each GPU during the inference process is quite huge.
Are there any ways to optimize resource usage during inference? Have you tried testing your method on the 70B model with 10k tokens QA?