Arthur151 / ROMP

Monocular, One-stage, Regression of Multiple 3D People and their 3D positions & trajectories in camera & global coordinates. ROMP[ICCV21], BEV[CVPR22], TRACE[CVPR2023]
https://www.yusun.work/
Apache License 2.0
1.33k stars 229 forks source link

Why sample smpl / camera directly from parameter map? #284

Open areiner222 opened 2 years ago

areiner222 commented 2 years ago

Hi,

Really impressive work with ROMP and BEV!

I was wondering generally around both implementations - why do you decide to create a dense parameter map for things like smpl rotations / body center translations and then sample from them based on the estimated body center indices instead of sampling from the final feature map based on your inferred centers and then predicting smpl rotations / centers based on the sampled feature vector?

Have you done any experimentation around this with respect to model training? Can you think of any advantages / disadvantages based on either of those methodologies?

Thank you and amazing work!

Arthur151 commented 2 years ago

Yes, BEV works in the same sampling way as you described, sampling feature vecters at center and then predict the final results.

ROMP is designed to have the simplest architecture and implementation. To sampling at the body center position, you need to first parse out the 2D center position from the 2D body center heatmap and then sampling the feature vector. This process is not the simplest choice in implementation. I have tried both implementation in ROMP and BEV. And the performance is nearly the same in two design choices.

areiner222 commented 2 years ago

Thanks for your reply.

So to confirm,

  1. In BEV, the feature vector is extracted at the x,y pixel location of the final x,y,z offsets and then a transformation occurs to map from the indexed feature vectors to regress to the smpl-a parameters.

  2. in ROMP, the resulting index of the argmax-pooled heatmap canters are used to index into a parameter map of the 142-length 6d smpl rotational parameters and that is fed directly to a smpl head. Presumably, as I think you mentioned you did at one point try, you could have indexed the feature vector based on the body centers and then regressed the 6d rotations based on the indexed feature vector?

And one last question - I noticed you did some pre-training in ROMP based on 2d keypoints / center heatmap. did you do the same for BEV?

Arthur151 commented 2 years ago

1.Yes, BEV runs in a sampling way like that. You know that, BEV needs to integrate the depth encoding vector to be more discriminative in depth. So sampling the 2D feature vector out would be more convenient to do that.

  1. Yes, for ROMP, I think current solution is more elegant. And the experiment results show (a little, not very obvious) improvement in performance.

  2. Yes, BEV starts from the ROMP pretraining, because you know they share the same backbone and inference logics. It would take very long time to converge if we don't perform 2d keypoints estimation pretraining. You can also directly use the pretraining of some 2d keypoints estimation methods, like Higher HRNet-32 for HRNet and Simple-baseline for ResNet-50. They will just work in extractly the same way. I have also tried to train from their pretrained checkpoints and it works smoothly.