Closed y-ich closed 4 years ago
Hello, how large are these models. Are both models approximately the same size? Speed difference could be due to a difference in network / network architecture size.
Hi.
The volumes of weight parameters are almost same but the numbers of operations are different.
The Residual Network consists of, reshape : 2 innerProduct : 3 add : 20 convolution : 43 batchnorm : 43 activation : 45
And the Squeeze-and-Excitation Network consists of reshape : 2 pooling : 19 split : 19 multiply : 19 loadConstant : 38 batchnorm : 40 convolution : 41 innerProduct : 41 add : 77 activation : 81
Could you tell me which operation seem to be bottleneck on A12(X)?
Thank you for your help.
Generally convolution and inner product (dense) are the most time consuming ops. Squeeze and excitation network has 41 dense layers compared to only 3 for residual. So one would expect it to take longer time (obviously it also depends on the size of dense and convolution layers)
Thank you!
I will check the speeds of dense layer on different hardwares.
Hi.
I checked each by Time Profile and I found that the SEnet uses CPUFP16Engine while the ResNet uses MLNeuralNetworkEngine.
I think the the SENet can be accelerated by Neural Engine since it has many convolutions.
Could you tell me why the SENet uses CPUFP16Engine (I don't use usesCPUOnly) and how I can enable Neural Engine for the SENet?
I confirmed that SENet also use BNNEngine::convolution_kernel and inner_product_kernel_cpu.
Are there any HW accelerations for inner_product available on A12?
And elementwise_kernel_cpu looks like heavy, too. What it is? And can I do anything to reduce it?
Thanks.
@y-ich Hi, yes, you are right, the coreml will use GPU to conduct broadcasted multipy instead of NPU, which maybe the real factor all models composed of SE modules run slower than CPU. I noticed these phonomenon while running mobilenetv3. I found mobilenetv3 run 3x times faster than mobv2 on CPU. However, when using NPU, mobv3 is 2x times slower than mobv2. Anyone gives some advices on how to accerlate the SE module?
I had determined that global poolings slow down Conv2Ds around them. I guess that Neural Engine has no ability to calculate global pooling and heavy memory transfer is needed between CPU and Neural Engine. I hope that iOS 14 or next generation Neural Engine will solve this issue.
Hi.
I confirmed that a new iPad Air processes this type of weights super fast. I guess that A14's Neural Engine seems to have a unit for pooling. Thank you so much for your great job, Apple!
Hi.
This question is not for coremltools but is for CoreML itself.
I am using both Residual Network and Squeeze-and-Excitation Network which are same size for same purpose.
Squeeze-and-Excitation Network is 1/4 slower than Residual Network on A12X. I do not figure out why it is so slow. Could you give me any advices?
Sorry for my vague question.