When I use NNPACK with caffe's convolution layer, I find it cost almost half of time to transform output data:
profile = {total = 0.35187557300014305, input_transform = 0.032353789985791082, kernel_transform = 0.00099588599914568476, output_transform = 0.14845979897654615, block_multiplication = 0.1696860040538013}
It looks like you have small input image with few input channels. Output transform time can't be reduces. You can get rid of kernel transform time, through, by pre-computing kernel transforms as described in #82
When I use NNPACK with caffe's convolution layer, I find it cost almost half of time to transform output data:
profile = {total = 0.35187557300014305, input_transform = 0.032353789985791082, kernel_transform = 0.00099588599914568476, output_transform = 0.14845979897654615, block_multiplication = 0.1696860040538013}
Is there any way to reduce this time?