MobileNetv3 with NPU not accerlerated!

dragen1860 commented 4 years ago

Dear all, Has anyone succed run mobileetv3 small/large on Iphone with NPU/GPU/CPU accelerated？ in my case, i found the mobilenetv3 small run on CPU with 10ms inference time, compared with 20ms for mobilenetv2 cpu. However, mobilenetv3 run on NPU need 11ms , compared with 3ms for mobilenetv2. I do not know why? any tips? Thank you.

hollance commented 4 years ago

This appears to happen because of the broadcasted multiply that follows the squeeze-and-excitation layers.

dragen1860 commented 4 years ago

@hollance oh.... any ways to accerlerate this? thank you.

hollance commented 4 years ago

The ANE seems to work fine with regular multiply layers, even if they do limited broadcasting, so you could try replacing the multiplyBroadcastable layer with a multiply layer (have not tried this yet myself).

dragen1860 commented 4 years ago

@hollance Wow, I saw your machinethink blogs. That's really detailed blogs. Helpful! Great appreciation to that. May I ask you for some suggestions. Currently I'm using Mobilenetv2 on Coreml. It's fast with 3ms on NPU for binary classification. However, I still want to get some accerleration done. Any tips to accelerate it? or is there Any deep models run faster than mobv2 on coreml? Thank you.

hollance commented 4 years ago

@dragen1860 Well, I did write a blog post about this: https://machinethink.net/blog/mobile-architectures/ 😄

dragen1860 commented 4 years ago

@hollance Yeah, i actully read your detailed blog already. But from your blog, it seems mobv2 and mobv3, efficientnet is the best choices. But currently Only mobv2 can be accelerated by NPU. these Modules containing SE module will run slower on NPU compared with CPU. Indeed, I try to transfer the broadcasted multiply by manual repeat and then multiply, as follows:

class SELayer(nn.Module):
    def __init__(self, channel, reduction=4, width=56):
        super(SELayer, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
                nn.Linear(channel, channel // reduction),
                nn.ReLU(inplace=True),
                nn.Linear(channel // reduction, channel),
                h_sigmoid()
        )

        self.width = width

    def forward(self, x):
        b, c, h, w = x.size()
        y = self.avg_pool(x).view(b, c)
        y = self.fc(y)
        y = y.view(b, c, 1, 1) 
        # repeat to [b, c, h, w]
        y = y.repeat(1, 1, self.width, self.width)
        print('x:', x.shape, 'y:', y.shape, self.width)
        out = x * y
        # reshape to vector and then mulitply
        out = torch.mul(x.view(-1), y.view(-1))
        # restore the shape
        out = out.reshape(b, c, h, w)

        return out

But it still not get accerlerated! any tips? Thank you.

hollance commented 4 years ago

The MobileNetV3 that I converted from TensorFlow (using tfcoreml) already uses plain "multiply" layers instead of "multiplyBroadcastable". The "multiply" layer already does a few limited kinds of broadcasting, so you don't need to manually duplicate the rows/columns of the tensor. That just makes everything slower. I'd need to dig deeper into exactly which layers are preventing the model from running on the ANE.

Sehaba95 commented 2 years ago

I trained MobileNet V3 using my dataset, then, I converted the trained model using coremltools. Everything worked well, except when testing my model on iPhone 12 Pro, it took +20 milliseconds for one inference, while MobileNet V2 took only 1 to 2 milliseconds for running one inference. Any suggestions on how did you solve this problem?

I used pretrained MobileNet V2 and V3 from PyTorch's Model Zoo.

junpeiz commented 2 years ago

@Sehaba95 For inference time debugging, a useful tool is Xcode with "Performance reports" and "Profile with instruments". See https://developer.apple.com/machine-learning/core-ml/ for details. Hope it helps!

apple / coremltools

MobileNetv3 with NPU not accerlerated! #657