Closed Sibozhu closed 5 years ago
Hi, thanks for your feedback. If I understood correctly, you need a simpler network for your "real-time" task. I'm not sure CPN is the best choice in this scenario, but if you keep using it, here's something you can have a try. First, you can try a lighter backbone such as ResNet18, Xception or MobileNet; Second, try to use only the smallest feature map in the cascaded part (currently feature maps of three different scales are used); You can also reduce the layer number in the cascaded part which may help a little bit. As far as I know, the most efficient approaches for key point detection are OpenPose and AlphaPose. You can take a look at these methods if CPN cannot achieve the goal.
Thank you so much for the rapid reply! Will definitely try those approaches.
I hope I can talk with you more in the future, but right now I'll just close this issue.
Thank you so much for the above advice on the network adjustment. I changed the backbone from the original ResNet152 to ResNet50, and the inference speed is roughly 1.5 times faster, so I decided to go further on tuning your network.
def GlobalNet18(config, pretrained=False): return GlobalNet(config, Bottleneck, [2, 2, 2, 2], torchvision.models.resnet18(pretrained=pretrained))
For changing backbone to ResNet18, it seems like I couldn't just replace the torchvision module name from resnet152 to resnet18 like the code above, otherwise, I got the error message below:
RuntimeError: Given groups=1, weight of size [256, 2048, 1, 1], expected input[4, 512, 2, 2] to have 2048 channels, but got 512 channels instead
I can see from your code that each backbone comes with a set of customized "num_blocks." With all my respect, how is that generated or calculated? For using ResNet18 for backbone, does calculate the num_blocks number would solve the error above once for all?
update: by changing line 53-56 of "cascade_pyramid_network.py" to:
self.latlayer1 = nn.Conv2d(512, 256, kernel_size=1, stride=1, padding=0)
self.latlayer2 = nn.Conv2d(256, 256, kernel_size=1, stride=1, padding=0)
self.latlayer3 = nn.Conv2d(128, 256, kernel_size=1, stride=1, padding=0)
self.latlayer4 = nn.Conv2d(64, 256, kernel_size=1, stride=1, padding=0)
And keeps num_blocks of resnet18 to be [2, 2, 2, 2], the training process is working correctly. I've got prediction speed at 20 fps and accuracy about 60%. I'm not sure if that is the correct adjustment though.
Another thing. Since most of my data is about 80*80 pictures, theoretically, in "config.py", changing the self.img_max_size
from 512 to 96 should make the inference speed significantly faster. But when it comes to reality, I don't find there is any difference between the change of max image size. I read your code and found all the pictures that feed into prediction would be scaled to the size of self.img_max_size,
but I don't understand why scale image size down wouldn't make prediction faster.
Also, for the chopping off the all the feature maps but leave the first one, can you give me some suggestions on which part of your code should I go to? Is that the three top layers in forward function of RefineNet in "cascade_pyramid_network.py"? By changing:
return self.output(torch.cat([p2, p3, p4, p5], dim=1))
to
return self.output(torch.cat([p2], dim=1))
The prediction speed gets 1fps faster. I'm not sure if this is the approach you mentioned above.
Thank you so much for your great work!
Best Regards, Sibo Zhu (I know for the convenience of the future readers of this repo to be also benefited from your wisdom, we should keep all the conversation within GitHub, since my questions might be complicated to explain, I'm more than happy to contact you via Wechat in Chinese, here's my Wechat id: zhusibo3. Afterward, I'll post all the solutions of my questions above if they got solved)
def forward(self, p2, p3, p4, p5):
p4 = self.bottleneck4(p4)
p5 = self.bottleneck5(p5)
return self.output(torch.cat([p4, p5], dim=1))
Then the forward will only pass the two smallest feature maps. You can even try to simplify the GlobalNet by removing small-strided features.
# Bottom-up
c1 = F.relu(self.bn1(self.conv1(x)))
c1 = F.max_pool2d(c1, kernel_size=3, stride=2, padding=1)
c2 = self.layer1(c1)
c3 = self.layer2(c2)
c4 = self.layer3(c3)
c5 = self.layer4(c4)
# Top-down
p5 = self.latlayer1(c5)
p4 = self._upsample_add(p5, self.latlayer2(c4))
p4 = self.toplayer1(p4)
return p4, p5
Note that the output size is changed in this case and you might need to modify the decoder part as well.
Thank you guys so much for this amazing repo, it's very inspiring to me.
Currently, I'm writing a project doing keypoints detection on traffic cone (which is similar to cloth but objects are much simpler). With only one class and 7 named keypoints on traffic cone keypoints, my trained network is about 300MB big and the inference speed is about 9 images per second on my 8GB RAM GTX1080. That means, the architecture is too deep here and the detection speed is kinda slow. The ideal detection speed would be 300 images per second under the same hardware.
Do you have any suggestions on how to modify the architecture to achieve that? Or maybe the Cascaded Pyramid Network is just too fancy for the traffic cone task?
Thank you so much!
Best Regards, Sibo Zhu