In order to achieve both high accuracy and low latency for mobile ,they proposed a block-puhched pruning method that divides each layer into blocks and learns different pruning patterns in each block. They succeeded in increasing the speed while maintaining the accuracy. The GPU is used to compute the convolutional layer, while the CPU is used to compute the other layers, to further improve the performance.
TL;DR
In order to achieve both high accuracy and low latency for mobile ,they proposed a block-puhched pruning method that divides each layer into blocks and learns different pruning patterns in each block. They succeeded in increasing the speed while maintaining the accuracy. The GPU is used to compute the convolutional layer, while the CPU is used to compute the other layers, to further improve the performance.
Paper URL
https://arxiv.org/abs/2009.05697
Submission Dates(yyyy/mm/dd)
2020/09/12
Authors and institutions
Yuxuan Cai, Hongjia Li, Geng Yuan, Wei Niu, Yanyu Li, Xulong Tang, Bin Ren, Yanzhi Wang
Methods
Results
Comments