Arthur151 / ROMP

Monocular, One-stage, Regression of Multiple 3D People and their 3D positions & trajectories in camera & global coordinates. ROMP[ICCV21], BEV[CVPR22], TRACE[CVPR2023]
https://www.yusun.work/
Apache License 2.0
1.36k stars 231 forks source link

World space translation #402

Open zhewei-mt opened 1 year ago

zhewei-mt commented 1 year ago

Hello there, Thanks for your amazing work and it really helps me a lot. But I have some questions about the output of ROMP.

  1. I notice there are two camera related output, "cam" and "cam_trans" respectively and they are both of dimension 3. What is the difference between these two?
  2. With the purpose of getting person's translation in world coordinate, how can I make use of the output of ROMP? Any help will be appreciate!!
Arthur151 commented 1 year ago

Thanks for your kind words! Good questions.

  1. The "cam" is a normalized format of "cam_trans". cam is more suitable for model to predict. cam_trans of BEV is the 3D human translation in a predefined camera space (FOV=60 degree).
  2. To convert the translation from our predefined camera space to a real one, you need to have the camera intrinsic. Then you can solve PnP to obtain the 3D human translation in the real camera space with our predicted 2D pose and 3D pose pair. This function would be helpful to achive this: https://github.com/Arthur151/ROMP/blob/a349c6bf4d6229b8fba2e900d38ef210888937d0/simple_romp/romp/utils.py#L331
zhewei-mt commented 1 year ago

Thanks for the quick reply. I found the code on how to convert "cam" to "cam_trans" using convert_cam_to_3d_trans function. I also notice another conversion here with the code: trans = [cam_tx, cam_ty, 2FOCAL_LENGTH/(CROP_SIZEcam_s + 1e-9)] I am wondering the difference between these two. Also, I am able to get a reasonable world translation to UE5 using "cam_trans" but things don't go right way when I try the conversion code above. Can you explain a little bit?