Could you give me any advices for Squeeze-and-Excitation Networks?

apple / coremltools

Core ML tools contain supporting tools for Core ML model conversion, editing, and validation.

https://coremltools.readme.io

BSD 3-Clause "New" or "Revised" License

4.44k stars 641 forks source link

Could you give me any advices for Squeeze-and-Excitation Networks? #360

Closed y-ich closed 4 years ago

y-ich commented 5 years ago

Hi.

This question is not for coremltools but is for CoreML itself.

I am using both Residual Network and Squeeze-and-Excitation Network which are same size for same purpose.

Squeeze-and-Excitation Network is 1/4 slower than Residual Network on A12X. I do not figure out why it is so slow. Could you give me any advices?

Sorry for my vague question.

Necross commented 5 years ago

Hello, how large are these models. Are both models approximately the same size? Speed difference could be due to a difference in network / network architecture size.

y-ich commented 5 years ago

Hi.

The volumes of weight parameters are almost same but the numbers of operations are different.

The Residual Network consists of, reshape : 2 innerProduct : 3 add : 20 convolution : 43 batchnorm : 43 activation : 45

And the Squeeze-and-Excitation Network consists of reshape : 2 pooling : 19 split : 19 multiply : 19 loadConstant : 38 batchnorm : 40 convolution : 41 innerProduct : 41 add : 77 activation : 81

Could you tell me which operation seem to be bottleneck on A12(X)?

Thank you for your help.

aseemw commented 5 years ago

Generally convolution and inner product (dense) are the most time consuming ops. Squeeze and excitation network has 41 dense layers compared to only 3 for residual. So one would expect it to take longer time (obviously it also depends on the size of dense and convolution layers)

y-ich commented 5 years ago

Thank you!

I will check the speeds of dense layer on different hardwares.

y-ich commented 5 years ago

Hi.

I checked each by Time Profile and I found that the SEnet uses CPUFP16Engine while the ResNet uses MLNeuralNetworkEngine. I think the the SENet can be accelerated by Neural Engine since it has many convolutions. ~~Could you tell me why the SENet uses CPUFP16Engine (I don't use usesCPUOnly) and how I can enable Neural Engine for the SENet?~~ I confirmed that SENet also use BNNEngine::convolution_kernel and inner_product_kernel_cpu. Are there any HW accelerations for inner_product available on A12? And elementwise_kernel_cpu looks like heavy, too. What it is? And can I do anything to reduce it?

Thanks.

dragen1860 commented 4 years ago

@y-ich Hi, yes, you are right, the coreml will use GPU to conduct broadcasted multipy instead of NPU, which maybe the real factor all models composed of SE modules run slower than CPU. I noticed these phonomenon while running mobilenetv3. I found mobilenetv3 run 3x times faster than mobv2 on CPU. However, when using NPU, mobv3 is 2x times slower than mobv2. Anyone gives some advices on how to accerlate the SE module?

y-ich commented 4 years ago

I had determined that global poolings slow down Conv2Ds around them. I guess that Neural Engine has no ability to calculate global pooling and heavy memory transfer is needed between CPU and Neural Engine. I hope that iOS 14 or next generation Neural Engine will solve this issue.

y-ich commented 4 years ago

Hi.

I confirmed that a new iPad Air processes this type of weights super fast. I guess that A14's Neural Engine seems to have a unit for pooling. Thank you so much for your great job, Apple!