Infini-AI-Lab / TriForce

[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
https://infini-ai-lab.github.io/TriForce/
231 stars 12 forks source link

The progress bar does not reflect for a long time #9

Open bulaikexiansheng opened 3 months ago

bulaikexiansheng commented 3 months ago

Thanks for your excellent work! But i met some questions when i try to use your framework.

I try to run offloading.py and offloading_TP.py on RTX4090 * 4 machine. As shown in the figure below, the progress bar has not been updated for a long time, but the graphics card usage is close to 100%.

1724296520350 image

The command i used:

CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=48 torchrun --nproc_per_node=2 test/offloading_TP.py --budget 12288 --prefill 130048 --dataset gs --target llama-7B-128K --on_chip 9 --gamma 16 --target /TriForce/models/Yarn-Llama-2-7b-128k

Is there something wrong?

preminstrel commented 3 months ago

Hello, thanks for your interest in our work! Nothing is wrong (this is expected). This is because the pre-filling phase is quite long due to offloading and limited computation resources. We are using iterative encoding to avoid OOM, which will make pre-filling much slower. One possible solution is to use pre-filling acceleration techniques like MInference.

bulaikexiansheng commented 3 months ago

您好,感谢您对我们的工作感兴趣!没有任何问题(这是意料之中的)。这是因为由于卸载和有限的计算资源,预填充阶段相当长。你需要等待。一种可能的解决方案是使用预填充加速技术,例如 MInference

Thanks for your reply! How long did it take you to test on the RTX4090? Will it be faster if I run it on the A100-80G?

preminstrel commented 3 months ago

Maybe 10-20 minutes on 4090s. Yes, on A100-80G, you do not need to offload since it can already fit. The expected pre-filling time is ~2 minutes on A100.

bulaikexiansheng commented 3 months ago

I noticed the --prefill command line parameter, and I thought I could set it smaller to run through the code. I set it to 104, but it seems to report the following error: image

preminstrel commented 3 months ago

104 maybe too small, try 32768 (32k)

bulaikexiansheng commented 3 months ago

104 maybe too small, try 32768 (32k)

Thanks,it works!

[Overall Latency]: 0.08335429636659987
[Overall Avg Accepted Tokens]: 11.336667760098464