main contribution of paper seems to work worse

wahrheit-git commented 4 years ago

The main contribution i.e., your "invention" of "BPnP" in the table 1, loss l_p (backpropable pnp) seems to be considerably worse than l_h loss (already published in eccv'18 paper "Making Deep Heatmaps Robust to Partial Occlusions for 3D Object Pose Estimation").

Can you please tell the reason why is it worse than l_h. Using l_m = (\beta * l_p) + l_h together does not make sense as they both do the same thing. Also, you have set \beta to very small value of 0.0002, which mean l_m does not get affected by bpnp loss much. The final results of l_m are not much better than l_h, and the change must be there just because of randomness.

BoChenYS commented 4 years ago

l_p performing worse than l_h is not surprising at all, as heatmaps are 2D feature maps (in the form of H*W matrices) while keypoint coordinates are just 2 scalars like (x, y). Training the model using target heatmaps exploits much richer spatial information than training the model only using target keypoint coordinates, which is why many keypoint regression models use heatmaps.

The setting of \beta is to balance the different magnitudes, which is large. Heatmap intensities are normalized by 2D softmax so that each channel sum to 1. That is, say H=480 and W=640, then the average intensities would be 1/(480*640), and squaring the errors makes them even smaller. On the other hand, l_p is measuring the squared distance between predicted and target keypoints. Note that x lies in [0, 640] and y in [0, 480], and squaring the distance makes them even greater. So the magnitudes of this squared distance and the heatmap MSE are largely different. We found that 0.0002 achieves a good balance.

I don't think the (\beta l_p) part and the l_h part are doing the same thing. For example, l_h forces the heatmap to approach the target 2D Gaussian bell shape, while (\beta l_p) forces the heatmaps to have equal mass above and below the target location, and equal mass on the left and right sides of the target location, because of the mechanism in DSNT. The (\beta * l_p) part also adds additional gradient flows/paths to train the model parameters. The result suggests that this can be beneficial.

wahrheit-git commented 4 years ago

You have mentioned two major points here.

l_p is worse because it only learns two scalars, l_h works better because target heatmaps exploits much richer spatial information.
l_m learns better because (\beta * l_p) forces the heatmaps to have equal mass above and below the target location, and equal mass on the left and right sides of the target location.

your point, 1 and 2 are contradictory to each other, since you are anyways using heatmaps as intermediate representation in l_p case (which you are talking about in 2.) and then use DSNT to compute the keypoints, it should not be considerably worse.

Another thing is if you really say in 2. that (\beta * l_p) forces the heatmaps to have equal mass above and below the target location, and equal mass on the left and right sides of the target location, you can simply do that by using MSE on ground truth values of keypoint coordinates i.e. x_gt and y_gt ground truth coordinates rather than using pose loss after the DSNT.

BoChenYS commented 4 years ago

I don't understand your argument about why l_p should not be worse than l_h.

when training with l_h, it creates target heatmaps with a 2D Gaussian "bell shape".
when training with l_p, it doesn't enforce this 2D Gaussian "bell shape" target, it simply guides the heatmap to be horizontally weighted at target x and vertically weighted at target y. Aren't these two constraints different? Why should they have the same performance?

About training with groundtruth keypoints coordinates, of course you can do that. But that would still be a 2-stage approach: image-to-keypoints, then after training, keypoints-to-pose. The point of BPnP is to enable image-to-pose end-to-end training. The purpose of the experiment is to demonstrate the capability of doing end-to-end, rahter than the necessity of end-to-end.

BoChenYS / BPnP

main contribution of paper seems to work worse #2