Closed dengxiaotian123 closed 8 months ago
Hello @dengxiaotian123, I tried to run cases with length of 16k, my hardware is as follows :
GPU : 4 x RTX 6000 Ada 50 Gb
When I tried running with single GPU, the tensor is unable to be allocated as it is too large for KV cache to fit even after converting the model to int4 quantization.
Another attempt, I tried to use deepspeed inference engine to perform Tensor Parallel even though it worked the output is garbage for both 8192 and 16384 ctx_length!
I hope this helps! I am not sure what I am doing wrong, can anyone help me too ?
Could you please provide your scripts? So that we can help you with the issue. With deepspeed inference engine and 4xA6000, 18k should be ok.
After reviewing the codes sent to us, we believe the issue comes from that Baichuan-7b has 40 layers while llama-2-7b has 32 layers. For long sequences, memory is mainly occupied by the Self-Attention which is proportional to the number of layers. Hence, for 16k, llama-2 -7b can be run with 2x80G A100 while Baichuan cannot
After reviewing the codes sent to us, we believe the issue comes from that Baichuan-7b has 40 layers while llama-2-7b has 32 layers. For long sequences, memory is mainly occupied by the Self-Attention which is proportional to the number of layers. Hence, for 16k, llama-2 -7b can be run with 2x80G A100 while Baichuan cannot
Thanks a lot!
Has anyone successfully run cases with a length of 16k? What kind of machine resources support the experiment?