OOM when length is 16k - Githubissues

datamllab / LongLM

[ICML'24 Spotlight] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning

https://arxiv.org/pdf/2401.01325.pdf

MIT License

597 stars 59 forks source link

OOM when length is 16k #16

Closed dengxiaotian123 closed 8 months ago

dengxiaotian123 commented 8 months ago

Has anyone successfully run cases with a length of 16k? What kind of machine resources support the experiment?

Aniketto16 commented 8 months ago

Hello @dengxiaotian123, I tried to run cases with length of 16k, my hardware is as follows :

GPU : 4 x RTX 6000 Ada 50 Gb

When I tried running with single GPU, the tensor is unable to be allocated as it is too large for KV cache to fit even after converting the model to int4 quantization.

Another attempt, I tried to use deepspeed inference engine to perform Tensor Parallel even though it worked the output is garbage for both 8192 and 16384 ctx_length!

I hope this helps! I am not sure what I am doing wrong, can anyone help me too ?

Mooler0410 commented 8 months ago

Could you please provide your scripts? So that we can help you with the issue. With deepspeed inference engine and 4xA6000, 18k should be ok.

Mooler0410 commented 8 months ago

After reviewing the codes sent to us, we believe the issue comes from that Baichuan-7b has 40 layers while llama-2-7b has 32 layers. For long sequences, memory is mainly occupied by the Self-Attention which is proportional to the number of layers. Hence, for 16k, llama-2 -7b can be run with 2x80G A100 while Baichuan cannot

dengxiaotian123 commented 8 months ago

After reviewing the codes sent to us, we believe the issue comes from that Baichuan-7b has 40 layers while llama-2-7b has 32 layers. For long sequences, memory is mainly occupied by the Self-Attention which is proportional to the number of layers. Hence, for 16k, llama-2 -7b can be run with 2x80G A100 while Baichuan cannot

Thanks a lot!