Arthur151 / ROMP

Monocular, One-stage, Regression of Multiple 3D People and their 3D positions & trajectories in camera & global coordinates. ROMP[ICCV21], BEV[CVPR22], TRACE[CVPR2023]
https://www.yusun.work/
Apache License 2.0
1.31k stars 228 forks source link

unstable 3D translation (especially along depth) for monocular pose estimation #285

Open ZhengdiYu opened 2 years ago

ZhengdiYu commented 2 years ago

Hi, long time no see. Glad to see your BEV's contribution. I have been playing with ROMP+blender now and developing something interesting. But the translation along the depth direction is always shaking, we can only set the camera right in front of the character, otherwise, it's a bit ugly.

I have two questions now:

  1. In ROMP, I remember that you said you determine the depth order by the scale. But I remember that you just transform add cam_trans to the verts then directly go into render process. I didn't see the code to "determine the depth order". I guess maybe you actually regard the translation recovery process: verts += cam_trans as the "determine depth order" process?

  2. In BEV, do we have a more stable 3D translation now? Will it still shake along the depth direction? I haven't tried BEV with Blender yet. I'll try this later.

Arthur151 commented 2 years ago

Hi, Zhengdi Yu, long time no see.

  1. About the depth ordering during rendering, maybe this is more clear: https://github.com/Arthur151/ROMP/blob/e4613fd564cd632ac531228b94f949eaed76345e/simple_romp/vis_human/main.py#L40
  2. Yes, I think BEV is more stable in depth. But I am still working on the depth shaking problem. I hope I can solve this in the next version.
ZhengdiYu commented 2 years ago

Hi, Zhengdi Yu, long time no see.

  1. About the depth ordering during rendering, maybe this is more clear: https://github.com/Arthur151/ROMP/blob/e4613fd564cd632ac531228b94f949eaed76345e/simple_romp/vis_human/main.py#L40
  2. Yes, I think BEV is more stable in depth. But I am still working on the depth shaking problem. I hope I can solve this in the next version.

Sorry for my late reply.

So it seems that only 'sim3dr' renderer uses this depth order? This repo has changed a lot since last time I visited it. Sorry if I misunderstood. I think ROMPv1 didn't have the depth_order thing since there are only pyrender and pytorch3d back then. It seems that when converting cam to trans3d, the depth will certainly be computed by 's' scale in the latest verison. But I think ROMPv1 uses a PnP instead simulating depth by 's'. Am I understanding correctly?

Q2. There are two ways to calculate the translation in simple-romp, where I believe the second one is the old one used in ROMPv1, the version I'm familiar with. And I also remember that the first method was marked as 'wrong' in the base_predictor.py. However I think the first method is now used in 'ROMP(not simple-romp)'. Is there some reason that you pick this method again for the latest ROMP but still inherit the old one for simple romp? image

Arthur151 commented 2 years ago

Yes, due to the usage of sim3dr, I think that directly convert cam to trans3d is of enough accuracy for sim3dr rendering usage. Therefore, I use it directly.

But BEV has made some progress in depth reasoning. So I just use BEV for depth estimation.