datamllab / LongLM

[ICML'24 Spotlight] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
https://arxiv.org/pdf/2401.01325.pdf
MIT License
549 stars 54 forks source link

Requires excessive computing resources when inference #9

Closed zhhvvv closed 3 months ago

zhhvvv commented 5 months ago

I use 2 GPUs. During inference (with llama-2-7B, and 10k quetions in the demo), each GPU requires about 50GB of additional memory, excluding the load model. But when I use 8 GPUs, the inference process also requires about 50GB of memory per card. why is that?

Also, I can not test llama-2-70B model, because even if 8 GPUs are used for simultaneous inference, the GPU memory required by each GPU during the inference process is quite huge.

Are there any ways to optimize resource usage during inference? Have you tried testing your method on the 70B model with 10k tokens QA?

Mooler0410 commented 5 months ago

Yes, current implementation does not consider any efficiency issues. Our flash attention version will come soon!

kungfu-eric commented 5 months ago

@zhhvvv have you tried: model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype= torch.float16, load_in_4bit= True, device_map="auto")

Mooler0410 commented 3 months ago

Hi! We just released the FlashAttention implementation with transformers==4.38.2. You may try it