Open bulaikexiansheng opened 3 months ago
Hello, thanks for your interest in our work! Nothing is wrong (this is expected). This is because the pre-filling phase is quite long due to offloading and limited computation resources. We are using iterative encoding to avoid OOM, which will make pre-filling much slower. One possible solution is to use pre-filling acceleration techniques like MInference.
您好,感谢您对我们的工作感兴趣!没有任何问题(这是意料之中的)。这是因为由于卸载和有限的计算资源,预填充阶段相当长。你需要等待。一种可能的解决方案是使用预填充加速技术,例如 MInference。
Thanks for your reply! How long did it take you to test on the RTX4090? Will it be faster if I run it on the A100-80G?
Maybe 10-20 minutes on 4090s. Yes, on A100-80G, you do not need to offload since it can already fit. The expected pre-filling time is ~2 minutes on A100.
I noticed the --prefill
command line parameter, and I thought I could set it smaller to run through the code. I set it to 104, but it seems to report the following error:
104 maybe too small, try 32768 (32k)
104 maybe too small, try 32768 (32k)
Thanks,it works!
[Overall Latency]: 0.08335429636659987
[Overall Avg Accepted Tokens]: 11.336667760098464
Thanks for your excellent work! But i met some questions when i try to use your framework.
I try to run
offloading.py
andoffloading_TP.py
on RTX4090 * 4 machine. As shown in the figure below, the progress bar has not been updated for a long time, but the graphics card usage is close to 100%.The command i used:
CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=48 torchrun --nproc_per_node=2 test/offloading_TP.py --budget 12288 --prefill 130048 --dataset gs --target llama-7B-128K --on_chip 9 --gamma 16 --target /TriForce/models/Yarn-Llama-2-7b-128k
Is there something wrong?