SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
MIT License
7.96k stars 412 forks source link

Support setting VRAM budget for `examples/server` #106

Closed hodlen closed 10 months ago

hodlen commented 10 months ago

Also, I set the default batch size to 32 (instead of 512) to avoid CUDA OOM at the prompt phase to improve server stability.