gathierry / FashionAI-KeyPointsDetectionOfApparel

FashionAI Key Points Detection using CPN model in Pytorch
Apache License 2.0
189 stars 61 forks source link

Is there anyway to improve the detection speed or lower the weight file? #7

Closed Sibozhu closed 5 years ago

Sibozhu commented 5 years ago

Thank you guys so much for this amazing repo, it's very inspiring to me.

Currently, I'm writing a project doing keypoints detection on traffic cone (which is similar to cloth but objects are much simpler). With only one class and 7 named keypoints on traffic cone keypoints, my trained network is about 300MB big and the inference speed is about 9 images per second on my 8GB RAM GTX1080. That means, the architecture is too deep here and the detection speed is kinda slow. The ideal detection speed would be 300 images per second under the same hardware.

Do you have any suggestions on how to modify the architecture to achieve that? Or maybe the Cascaded Pyramid Network is just too fancy for the traffic cone task?

Thank you so much!

Best Regards, Sibo Zhu

gathierry commented 5 years ago

Hi, thanks for your feedback. If I understood correctly, you need a simpler network for your "real-time" task. I'm not sure CPN is the best choice in this scenario, but if you keep using it, here's something you can have a try. First, you can try a lighter backbone such as ResNet18, Xception or MobileNet; Second, try to use only the smallest feature map in the cascaded part (currently feature maps of three different scales are used); You can also reduce the layer number in the cascaded part which may help a little bit. As far as I know, the most efficient approaches for key point detection are OpenPose and AlphaPose. You can take a look at these methods if CPN cannot achieve the goal.

Sibozhu commented 5 years ago

Thank you so much for the rapid reply! Will definitely try those approaches.

I hope I can talk with you more in the future, but right now I'll just close this issue.

Sibozhu commented 5 years ago

Thank you so much for the above advice on the network adjustment. I changed the backbone from the original ResNet152 to ResNet50, and the inference speed is roughly 1.5 times faster, so I decided to go further on tuning your network.

  1. In "cascade_pyramid_network.py":

def GlobalNet18(config, pretrained=False): return GlobalNet(config, Bottleneck, [2, 2, 2, 2], torchvision.models.resnet18(pretrained=pretrained)) For changing backbone to ResNet18, it seems like I couldn't just replace the torchvision module name from resnet152 to resnet18 like the code above, otherwise, I got the error message below: RuntimeError: Given groups=1, weight of size [256, 2048, 1, 1], expected input[4, 512, 2, 2] to have 2048 channels, but got 512 channels instead

I can see from your code that each backbone comes with a set of customized "num_blocks." With all my respect, how is that generated or calculated? For using ResNet18 for backbone, does calculate the num_blocks number would solve the error above once for all?

update: by changing line 53-56 of "cascade_pyramid_network.py" to: self.latlayer1 = nn.Conv2d(512, 256, kernel_size=1, stride=1, padding=0) self.latlayer2 = nn.Conv2d(256, 256, kernel_size=1, stride=1, padding=0) self.latlayer3 = nn.Conv2d(128, 256, kernel_size=1, stride=1, padding=0) self.latlayer4 = nn.Conv2d(64, 256, kernel_size=1, stride=1, padding=0)

And keeps num_blocks of resnet18 to be [2, 2, 2, 2], the training process is working correctly. I've got prediction speed at 20 fps and accuracy about 60%. I'm not sure if that is the correct adjustment though.

  1. Another thing. Since most of my data is about 80*80 pictures, theoretically, in "config.py", changing the self.img_max_size from 512 to 96 should make the inference speed significantly faster. But when it comes to reality, I don't find there is any difference between the change of max image size. I read your code and found all the pictures that feed into prediction would be scaled to the size of self.img_max_size, but I don't understand why scale image size down wouldn't make prediction faster.

  2. Also, for the chopping off the all the feature maps but leave the first one, can you give me some suggestions on which part of your code should I go to? Is that the three top layers in forward function of RefineNet in "cascade_pyramid_network.py"? By changing: return self.output(torch.cat([p2, p3, p4, p5], dim=1)) to return self.output(torch.cat([p2], dim=1)) The prediction speed gets 1fps faster. I'm not sure if this is the approach you mentioned above.

Thank you so much for your great work!

Best Regards, Sibo Zhu (I know for the convenience of the future readers of this repo to be also benefited from your wisdom, we should keep all the conversation within GitHub, since my questions might be complicated to explain, I'm more than happy to contact you via Wechat in Chinese, here's my Wechat id: zhusibo3. Afterward, I'll post all the solutions of my questions above if they got solved)

gathierry commented 5 years ago
  1. I think your modification is correct. The first parameter of Conv2d is the input channel, so it should be coherent with the feature map of ResNet. I've never tested ResNet18 but it seems like its channel numbers is different from ResNet50 or ResNet152. In fact, this global net is modified based on the official implementation of resnet.
  2. I think you are right about the inference speed vs input scale. I don't understand this phenomena either. Are you sure the images are exactly scaled to 96?
  3. For example in RefineNet
    def forward(self, p2, p3, p4, p5):
            p4 = self.bottleneck4(p4)
            p5 = self.bottleneck5(p5)
            return self.output(torch.cat([p4, p5], dim=1))

    Then the forward will only pass the two smallest feature maps. You can even try to simplify the GlobalNet by removing small-strided features.

        # Bottom-up
        c1 = F.relu(self.bn1(self.conv1(x)))
        c1 = F.max_pool2d(c1, kernel_size=3, stride=2, padding=1)
        c2 = self.layer1(c1)
        c3 = self.layer2(c2)
        c4 = self.layer3(c3)
        c5 = self.layer4(c4)
        # Top-down
        p5 = self.latlayer1(c5)
        p4 = self._upsample_add(p5, self.latlayer2(c4))
        p4 = self.toplayer1(p4)
        return p4, p5

    Note that the output size is changed in this case and you might need to modify the decoder part as well.