Fix offloading / VRAM budget bugs

SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

MIT License

7.96k stars 412 forks source link

Fix offloading / VRAM budget bugs #85

Open hodlen opened 10 months ago

hodlen commented 10 months ago

After releasing online FFN offloading, we have found new issues in:

[x] Decoding bug: #77.
[x] Python module issue: #55, #78.
[ ] Inaccuracy when offloading under a VRAM budget: #26, #38.

Some users also posted some errors per FFN offloading on social media that might need further investigate.

hodlen commented 10 months ago

We should also consider VRAM overhead under different batch processing sizes. When batch size grows, it is likely to encounter CUDA OOM during the prompt phase.

qw1319 commented 4 months ago

这个问题有解决吗？这边直接运行也看到gpu_offload未提前加载权重第一步：报错没有activation文件夹；

这边手动增加activation文件夹（fake）后，执行python依然报错