HRNet

《Deep High-Resolution Representation Learning for Human Pose Estimation》

简介
方法
实验
思考
参考文献

简介

Paper: CVPR2019 中科大和微软亚洲研究院
Code: github
姿态估计任务：检测一张图像中的人体关键点，也叫作关键点检测。
解决问题：为了生成精确的heatmap，如何得到高分辨率的特征表示？相比于从low-resolution恢复到high-resolution，本文提出的HRNet在保持high-resolution的同时融合low-resolution的特征，不丢失丰富的细节特征。
主要贡献：
- high-resolution representation learning
- repeated multi-scale fusion/aggregation
相关工作：
- 传统方法：概率图模型；图结构模型；
- 深度学习：回归关键点位置；生成keypoint heatmap，选择热力值高的点作为关键点；
- pipeline：一般的，通过主干网提取特征后，经过HRNet得到高分辨率表征，最后回归出keypoint heatmap。（multi-scale fusion, without intermediate supervision）

方法

网络结构：

It generates reliable high-resolution representations through repeatedly fusing the representations produced by the high-to-low subnetworks.

:arrow_up:：最近邻上采样 + 1x1conv；:arrow_down:：3x3conv, stride=2； :arrow_right:: 平移复制；

不同尺度特征图融合：concat；

重复使用低分辨率的特征图增强高分辨率表示。最后得到的特征图既能保持高分辨表征，还具有多尺度信息。

parallel high-to-low resolution subnetworks
multi-resolution subnetworks (multi-scale fusion)

损失函数

与人群计数类似，ground_truth heatmap由高斯核平滑每个关键点的位置得到。

计算与回归的heatmap的差异：MSE函数。

实验

数据集：COCO Keypoint Detection Dataset，MPII Human Pose Estimation，Pose Tracking Dataset
结果：

不同resolution的特征图对关键点预测的影响：1/8的特征图结果最好

网络输入图像尺寸的影响：HRNet在小尺寸128x96图像上的提升最为显著，能够证明高分辨率表征对小目标的作用。

思考:thinking:

:question:文章中关于exchange unit对应multi-scale fusion的叙述让人摸不着头脑

:dart: take-home-message

使用低分辨率特征图补充图像信息。
相同深度的多尺度信息的传递。

:sparkles: 网络结构简单高效，可根据训练目标进行调整：

姿态估计：只输出高分辨率特征图。
语义分割，人脸对齐：最后一层输出所有分辨率的特征图，对低分辨率特征图上采样后与高分辨率特征图concat，经过1*1卷积，softmax层生成分割预测图。
图像分类：所有分辨率的特征图经过bottleneck层，通道数翻倍后，从高分辨率特征图依次经过strided conv与低分辨率图进行元素加操作，再经过1*1卷积使通道翻倍（1024->2048），全局平均池化后送入分类器。
目标检测：将语义分割中拼接的特征图经过不同尺度的平均池化操作产生不同级别的特征表示，经过1*1卷积形成特征金字塔。

:sparkles: 多尺度信息的融合方式可以更加丰富和可解释性，如semantic-guided, weight-based.

参考文献:books:

论文阅读HRNetV1,HRNetV2,HRNetV2p

《Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs》

《Multi-scale structure-aware network for human pose estimation》

《Fast multi-person pose estimation using pose residual network》

《Pose partition networks for multi-person pose estimation》

《Pose proposal networks》

《Integral human pose regression》

《Deeply learned compositional models for human pose estimation》

HigherHRNet

《Bottom-Up Higher-Resolution Networks for Multi-Person Pose Estimation》

简介

Paper: CoRR
Code: github
解决问题：bottom-up姿态估计方法中的scale variation问题。小的目标需要大的特征图进行预测。（top-down方法中的scale问题存在于目标检测中）
主要贡献：
- HRNet+反卷积 :arrow_right: hight-quality scale-aware heatmaps
- multi-resolution supervision strategy：损失函数为不同分辨率特征图的损失之和。
- multi-resolution heatmap aggregation strategy：使用最近邻上采样将输出的不同分辨率特征图融合。
相关工作：
- top-down方法：给定一张图像，先检测，再从每个检测框中检测关键点。如HRNet.
- bottom-up方法：给定一张图像，先检测所有关键点，再grouping每个人的关键点。如HIgherHRNet, OpenPose, PersonLab.
- High resolution feature maps：Encode-decoder, Dilated Convolution, Deconvolution。

方法

网络结构：

the input to our deconvolution module is the concatenation of the feature maps and the predicted heatmaps from either HRNet or previous deconvolution modules.

在经过HRNet后的特征图经过4x4的反卷积上采样2倍，再通过4个residual block输出heatmap。

实验

数据集：COCO Keypoint Detection Dataset.
结果：SOTA，但是参数比Bottom-Up的HRNet多了将近3倍。
- HRNet vs. HigherHRNet：小目标的关键点检测结果更好，说明大的特征图对预测关键点更好；同时使用多分辨率的损失和融合策略，在提升小目标预测性能的同时，能保证大目标的预测性能不降低。
- 网络各个模块的分析：反卷积（没有给出其他上采样方法的结果对比）；特征图拼接（感觉没有讲清楚这部分拼接的必要性，只是涨点了）；不同分辨率的特征图融合；
- 输入图像的尺寸：与HRNet相比，对于输入图像的尺寸变化更不敏感。
- 训练图像的尺寸：尺寸太大对大目标不友好，也是一个超参。

思考:thinking:

:question:创新点

:dart: take-home-message

反卷积的使用

:sparkles: 网络拼接工作，美文一篇，实验做的详实，可以作为写论文的参考（从提出问题，解决问题，实验设计等）。

参考文献:books:

《Searching for efficient multi-scale ar- chitectures for dense image prediction》

YujunXie / Papers-of-Crowd-Counting

HRNet #1

HRNet

简介

方法

实验

思考:thinking:

参考文献:books:

HigherHRNet

简介

方法

实验

思考:thinking:

参考文献:books: