Closed dragen1860 closed 4 years ago
This appears to happen because of the broadcasted multiply that follows the squeeze-and-excitation layers.
@hollance oh.... any ways to accerlerate this? thank you.
The ANE seems to work fine with regular multiply layers, even if they do limited broadcasting, so you could try replacing the multiplyBroadcastable layer with a multiply layer (have not tried this yet myself).
@hollance Wow, I saw your machinethink blogs. That's really detailed blogs. Helpful! Great appreciation to that. May I ask you for some suggestions. Currently I'm using Mobilenetv2 on Coreml. It's fast with 3ms on NPU for binary classification. However, I still want to get some accerleration done. Any tips to accelerate it? or is there Any deep models run faster than mobv2 on coreml? Thank you.
@dragen1860 Well, I did write a blog post about this: https://machinethink.net/blog/mobile-architectures/ 😄
@hollance Yeah, i actully read your detailed blog already. But from your blog, it seems mobv2 and mobv3, efficientnet is the best choices. But currently Only mobv2 can be accelerated by NPU. these Modules containing SE module will run slower on NPU compared with CPU.
Indeed, I try to transfer the broadcasted multiply by manual repeat
and then multiply
, as follows:
class SELayer(nn.Module):
def __init__(self, channel, reduction=4, width=56):
super(SELayer, self).__init__()
self.avg_pool = nn.AdaptiveAvgPool2d(1)
self.fc = nn.Sequential(
nn.Linear(channel, channel // reduction),
nn.ReLU(inplace=True),
nn.Linear(channel // reduction, channel),
h_sigmoid()
)
self.width = width
def forward(self, x):
b, c, h, w = x.size()
y = self.avg_pool(x).view(b, c)
y = self.fc(y)
y = y.view(b, c, 1, 1)
# repeat to [b, c, h, w]
y = y.repeat(1, 1, self.width, self.width)
print('x:', x.shape, 'y:', y.shape, self.width)
out = x * y
# reshape to vector and then mulitply
out = torch.mul(x.view(-1), y.view(-1))
# restore the shape
out = out.reshape(b, c, h, w)
return out
But it still not get accerlerated! any tips? Thank you.
The MobileNetV3 that I converted from TensorFlow (using tfcoreml) already uses plain "multiply" layers instead of "multiplyBroadcastable". The "multiply" layer already does a few limited kinds of broadcasting, so you don't need to manually duplicate the rows/columns of the tensor. That just makes everything slower. I'd need to dig deeper into exactly which layers are preventing the model from running on the ANE.
I trained MobileNet V3 using my dataset, then, I converted the trained model using coremltools. Everything worked well, except when testing my model on iPhone 12 Pro, it took +20 milliseconds for one inference, while MobileNet V2 took only 1 to 2 milliseconds for running one inference. Any suggestions on how did you solve this problem?
I used pretrained MobileNet V2 and V3 from PyTorch's Model Zoo.
@Sehaba95 For inference time debugging, a useful tool is Xcode with "Performance reports" and "Profile with instruments". See https://developer.apple.com/machine-learning/core-ml/ for details. Hope it helps!
Dear all, Has anyone succed run mobileetv3 small/large on Iphone with NPU/GPU/CPU accelerated? in my case, i found the mobilenetv3 small run on CPU with 10ms inference time, compared with 20ms for mobilenetv2 cpu. However, mobilenetv3 run on NPU need 11ms , compared with 3ms for mobilenetv2. I do not know why? any tips? Thank you.