SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
MIT License
7.9k stars 406 forks source link

Will using only CPU be faster than llama.cpp? #140

Open liutt1312 opened 8 months ago

liutt1312 commented 8 months ago

Will using only CPU be faster than llama.cpp?

hodlen commented 8 months ago

Compared with llama.cpp in CPU-only mode, yes. PowerInfer can reduce ~50% FLOPS end to end, depending on the model architecture and sparsity. So a ~2x speedup with CPU decoding would be expected.