Closed wahrheit-git closed 4 years ago
l_p performing worse than l_h is not surprising at all, as heatmaps are 2D feature maps (in the form of H*W matrices) while keypoint coordinates are just 2 scalars like (x, y). Training the model using target heatmaps exploits much richer spatial information than training the model only using target keypoint coordinates, which is why many keypoint regression models use heatmaps.
The setting of \beta is to balance the different magnitudes, which is large. Heatmap intensities are normalized by 2D softmax so that each channel sum to 1. That is, say H=480 and W=640, then the average intensities would be 1/(480*640), and squaring the errors makes them even smaller. On the other hand, l_p is measuring the squared distance between predicted and target keypoints. Note that x lies in [0, 640] and y in [0, 480], and squaring the distance makes them even greater. So the magnitudes of this squared distance and the heatmap MSE are largely different. We found that 0.0002 achieves a good balance.
I don't think the (\beta l_p) part and the l_h part are doing the same thing. For example, l_h forces the heatmap to approach the target 2D Gaussian bell shape, while (\beta l_p) forces the heatmaps to have equal mass above and below the target location, and equal mass on the left and right sides of the target location, because of the mechanism in DSNT. The (\beta * l_p) part also adds additional gradient flows/paths to train the model parameters. The result suggests that this can be beneficial.
You have mentioned two major points here.
your point, 1 and 2 are contradictory to each other, since you are anyways using heatmaps as intermediate representation in l_p case (which you are talking about in 2.) and then use DSNT to compute the keypoints, it should not be considerably worse.
Another thing is if you really say in 2. that (\beta * l_p) forces the heatmaps to have equal mass above and below the target location, and equal mass on the left and right sides of the target location, you can simply do that by using MSE on ground truth values of keypoint coordinates i.e. x_gt and y_gt ground truth coordinates rather than using pose loss after the DSNT.
I don't understand your argument about why l_p should not be worse than l_h.
About training with groundtruth keypoints coordinates, of course you can do that. But that would still be a 2-stage approach: image-to-keypoints, then after training, keypoints-to-pose. The point of BPnP is to enable image-to-pose end-to-end training. The purpose of the experiment is to demonstrate the capability of doing end-to-end, rahter than the necessity of end-to-end.
The main contribution i.e., your "invention" of "BPnP" in the table 1, loss l_p (backpropable pnp) seems to be considerably worse than l_h loss (already published in eccv'18 paper "Making Deep Heatmaps Robust to Partial Occlusions for 3D Object Pose Estimation").
Can you please tell the reason why is it worse than l_h. Using l_m = (\beta * l_p) + l_h together does not make sense as they both do the same thing. Also, you have set \beta to very small value of 0.0002, which mean l_m does not get affected by bpnp loss much. The final results of l_m are not much better than l_h, and the change must be there just because of randomness.