TL;DR

In order to achieve both high accuracy and low latency for mobile ,they proposed a block-puhched pruning method that divides each layer into blocks and learns different pruning patterns in each block. They succeeded in increasing the speed while maintaining the accuracy. The GPU is used to compute the convolutional layer, while the CPU is used to compute the other layers, to further improve the performance.

FPS vs mAP on MS COCO dataset convolution