bmartacho / UniPose

We propose UniPose, a unified framework for human pose estimation, based on our “Waterfall” Atrous Spatial Pooling architecture, that achieves state-of-art-results on several pose estimation metrics. Current pose estimation methods utilizing standard CNN architectures heavily rely on statistical postprocessing or predefined anchor poses for joint localization. UniPose incorporates contextual seg- mentation and joint localization to estimate the human pose in a single stage, with high accuracy, without relying on statistical postprocessing methods. The Waterfall module in UniPose leverages the efficiency of progressive filter- ing in the cascade architecture, while maintaining multi- scale fields-of-view comparable to spatial pyramid config- urations. Additionally, our method is extended to UniPose- LSTM for multi-frame processing and achieves state-of-the- art results for temporal pose estimation in Video. Our re- sults on multiple datasets demonstrate that UniPose, with a ResNet backbone and Waterfall module, is a robust and efficient architecture for pose estimation obtaining state-of- the-art results in single person pose detection for both sin- gle images and videos.
Other
211 stars 44 forks source link

Number of classes in last_conv of decoder #12

Closed HartleyTW closed 3 years ago

HartleyTW commented 3 years ago

Hi,

Thanks for this code, it's really useful.

I had a quick query about the last convolution in the decoder layer (labelled below). In the num_classes there is an addition of 5. What does this do? Removing it allows the LSP and MPII models to be loaded correctly, leaving it there gives the same error that issue #6 highlighted.

self.last_conv = nn.Sequential(nn.Conv2d(304, 256, kernel_size=3, stride=1, padding=1, bias=False),
                                       BatchNorm(256),
                                       nn.ReLU(),
                                       nn.Dropout(0.5),
                                       nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1, bias=False),
                                       BatchNorm(256),
                                       nn.ReLU(),
                                       nn.Dropout(0.1),
(THIS LINE ---->)                      nn.Conv2d(256, num_classes+1+5, kernel_size=1, stride=1))
bmartacho commented 3 years ago

Thank you for the feedback. The addition of 5 refers to the extraction of extra keypoints for the bounding box surrounding the person. For the case of simple keypoint estimation, please remove the addition of 5.