SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
MIT License
7.96k stars 412 forks source link

Optimize CUDA sparse operator with Tensor Core #97

Open hodlen opened 10 months ago

hodlen commented 10 months ago

As for now, PowerInfer uses CUDA cores for sparse operator computation, which is not efficient for prompt phase computation. In order to further support multi batch services, PowerInfer plans to use Tensor core to further optimize sparse operators.