Arthur151 / ROMP

Monocular, One-stage, Regression of Multiple 3D People and their 3D positions & trajectories in camera & global coordinates. ROMP[ICCV21], BEV[CVPR22], TRACE[CVPR2023]
https://www.yusun.work/
Apache License 2.0
1.36k stars 231 forks source link

A simple question about camera and coordinate system. #210

Closed ZhengdiYu closed 2 years ago

ZhengdiYu commented 2 years ago

Hi, I have a simple question about ROMP. I have been struggling putting people into their correct relative position, but is it really possible using the root-aligned SMPL meshes without predicting their transl? (And if we have camera param K, will it be possible? )

  1. What is the coordinate system of the vertices that are used for rendering? I think we are predicting camera coordinate system points but root-aligned, correct?

  2. Following Q1, before rendering verts onto image, there is a trans added to verts ('cam_trans in projection.py') What is it? and what is estimate_translation actually doing? Is this estimating root's position? https://github.com/Arthur151/ROMP/blob/e30b7d17f13089fa9fa114df494192e31b0f43ed/romp/lib/visualization/visualization.py#L61

  3. I tried to replace the verts +trans in Q2 with GT mesh, so verts=GT_verts, without any other changes to your code, but the results are not correct, I expect it to be fully matched the person on the image but there are always shifts, and I also can't use the same FOV otherwise it would be a very small mesh on the image.

Sorry if I understand anything wrong. I think rendering is the final part I didn't understand in your code. Looking forward you for your answer!

Zhengdi

nikkorejz commented 2 years ago

This is an interesting question. I also want to ask about the coordinates of the camera and its settings. @Arthur151

The bottom line is that I converted the model to ONNX, got 2 tensors - center_maps and params_maps. I want parse the result and display 2d points from ONNX model (see image)

image

Another example:

image

It is worth noting that all the pictures from ONNX were turned in Y, so I turned them manually in the pictures

It seems that the recognition result is correct, but the camera view is from a different angle.

The backbone is the same for ONNX and .pkl file - ResNet50 (from repo)

The Image (512*512) is same for ONNX test and .pkl test (via romp.predict.image)

But pj2d is different (and cam_trans at least). Furthermore, pj2d not in between [-1; 1]. What I do wrong?

Code to parse ONNX result:

    # Prepare CenterMaps
    with open('/usr/src/app/dummies/ResNet50ExistThreePersons/ResNet50_CenterMaps.txt') as f:
        center_maps_str = f.readline() # Just string with values
    center_maps = tensor_from_str(center_maps_str, delimiter=' ').cuda()
    center_maps = center_maps.reshape([1, 1, 64, 64])

    # Prepare ParamsMaps
    with open('/usr/src/app/dummies/ResNet50ExistThreePersons/ResNet50_ParamsMaps.txt') as f:
        params_maps_str = f.readline() # Just string with values
    params_maps = tensor_from_str(params_maps_str, delimiter=' ').cuda()
    params_maps = params_maps.reshape([1, 145, 64, 64])

    # Start parsing
    outputs = {'center_map': center_maps.float(), 'params_maps': params_maps.float()}
    demo_cfg = {'mode': 'parsing', 'calc_loss': False}
    meta_data = {
        'offsets': torch.Tensor([[512., 512.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.]])
    }
    result_parser = ResultParser()

    outputs, meta_data = result_parser.parse_maps(outputs, meta_data, demo_cfg)
    smpl = SMPLWrapper()
    outputs = smpl(outputs, meta_data)

    points = outputs['pj2d'].cpu().detach().numpy()
    result = []
    for i, subpoints in enumerate(points):
        result.append((subpoints+ [1,1]) / 2 * [512, 512])
    return result

Thanks.

Arthur151 commented 2 years ago

Hi, Zhengdi, @ZhengdiYu

I'm glad to finally know your "name". Ha~

About your question: 1 & 2. In ROMP, it only estimate the scale of people and their x-y translation in image plane. I use PnP algorithm to estimate the corresponding 3D translation via estimate_translation. PnP algorithm solves the 3D translation of a perspective camera via exploring the mapping function between root-aligned 3D pose and its corresponding 2D pose.

  1. Please check whether your GT mesh is aligned in root like the predicted mesh. https://github.com/Arthur151/ROMP/blob/e30b7d17f13089fa9fa114df494192e31b0f43ed/romp/lib/models/smpl.py#L335
Arthur151 commented 2 years ago

@ArtiX-GP , Hi, Nikita G., I guess maybe you overlooked this: https://github.com/Arthur151/ROMP/blob/e30b7d17f13089fa9fa114df494192e31b0f43ed/romp/lib/models/modelv1.py#L48

B.T.W., we promote the onnx model in simple-romp, please refer to https://github.com/Arthur151/ROMP/tree/master/simple_romp --onnx

yuchen-ji commented 2 years ago

您好,请问一下romp能够得到人的相机空间中的坐标吗,我发现输出的fbx他的根部都是(0,0,0)对齐的 刚刚接触这个方面,希望您能稍微解答一下~

Arthur151 commented 2 years ago

是可以的,请参见这里 https://github.com/Arthur151/ROMP/blob/e30b7d17f13089fa9fa114df494192e31b0f43ed/romp/exports/convert_fbx.py#L247https://github.com/Arthur151/ROMP/blob/e30b7d17f13089fa9fa114df494192e31b0f43ed/romp/exports/convert_fbx.py#L172 如果不去减pelvis_position,就是绝对空间位置了。

ZhengdiYu commented 2 years ago

Hi,

I mean that I replaced the predicted mesh + cam_trans with GT_Mesh + its own GT transl but can't get equivalent results. Do you mean that I should do GT_Mesh - root_position + GT transl instead GT_Mesh + its own GT transl?

What is the difference between the estimated cam_trans and GT transl? I'm just wondering is there a way to put the people into camera coordinate system.

nikkorejz commented 2 years ago

Hi, @Arthur151

In ROMP model https://github.com/Arthur151/ROMP/blob/master/simple_romp/romp/model.py this line commented with description not supported by tensorRT

#cam_maps[:, 0] = torch.pow(1.1,cam_maps[:, 0]) # not supported by tensorRT

Does it really turn out that the model cannot be converted to ONNX? :(

Arthur151 commented 2 years ago

@ZhengdiYu , Zhengdi, Please check this function: https://github.com/Arthur151/ROMP/blob/704a5ea7f0e8e5041782622b5fc305dbed9733c3/romp/lib/utils/projection.py#L39 Camera coordinate system is defined by the proj_mat in this function. Therefore, if you want to get the predicted translation is GT Camera coordinate system, you just need to provide the right proj_mat, which is commonly called extrinsic & intrinsic camera matrix / camera projection matrix. If you understand estimate_translation, you will know it can transform the 3D translation from our pre-defined camera space to the target one, like GT Camera coordinate system you want here.

ZhengdiYu commented 2 years ago

ONNX

@ZhengdiYu , Zhengdi, Please check this function:

https://github.com/Arthur151/ROMP/blob/704a5ea7f0e8e5041782622b5fc305dbed9733c3/romp/lib/utils/projection.py#L39

Camera coordinate system is defined by the proj_mat in this function. Therefore, if you want to get the predicted translation is GT Camera coordinate system, you just need to provide the right proj_mat, which is commonly called extrinsic & intrinsic camera matrix / camera projection matrix. If you understand estimate_translation, you will know it can transform the 3D translation from our pre-defined camera space to the target one, like GT Camera coordinate system you want here.

Thanks! I will look into this, I do have the camera intrinsic.

Arthur151 commented 2 years ago

@ArtiX-GP Come on! You just need to put it in post-processing. We don't have to put it in model. B.T.W, I have make it. Yes, I got the TensorRT model. Please open another issue to discuss other issues.

nikkorejz commented 2 years ago

Thanks a lot! I will try :)

ZhengdiYu commented 2 years ago

@Arthur151

Finally, so verts+cam_trans (without proj_mat) is actually not in the true camera coordiante system, right?

If I still want to project the GT mesh onto the image while keepoing the rendering code the same as yours, instead of GT_Mesh + its own GT transl, what should I use to replace verts = verts+cam_trans? Is there a way to do so or should I change the FOV camera?

Arthur151 commented 2 years ago

verts+cam_trans is in our predefined camera space.

You can use estimate_translation to convert it back from GT to our camera space. I suggest to use our new renderer in simple-romp, which is much better.

https://github.com/Arthur151/ROMP/blob/704a5ea7f0e8e5041782622b5fc305dbed9733c3/simple_romp/romp/main.py#L167

Defined in https://github.com/Arthur151/ROMP/blob/master/simple_romp/vis_human/main.py

yuchen-ji commented 2 years ago

是可以的,请参见这里

https://github.com/Arthur151/ROMP/blob/e30b7d17f13089fa9fa114df494192e31b0f43ed/romp/exports/convert_fbx.py#L247

https://github.com/Arthur151/ROMP/blob/e30b7d17f13089fa9fa114df494192e31b0f43ed/romp/exports/convert_fbx.py#L172

如果不去减pelvis_position,就是绝对空间位置了。

谢谢您的指导!我按照您的方式尝试了,root节点的位置是绝对空间位置了 但是root节点是一直锁定的,并不会随着移动 就像这样:他在yuan原视频中是沿着箭头方向向右走动的 但是blender中他只被锁定在了根节点 1650539731(1)

ZhengdiYu commented 2 years ago

verts+cam_trans is in our predefined camera space.

You can use estimate_translation to convert it back from GT to our camera space. I suggest to use our new renderer in simple-romp, which is much better.

https://github.com/Arthur151/ROMP/blob/704a5ea7f0e8e5041782622b5fc305dbed9733c3/simple_romp/romp/main.py#L167

Defined in https://github.com/Arthur151/ROMP/blob/master/simple_romp/vis_human/main.py

Got it ! Thanks so much for clarification~

Arthur151 commented 2 years ago

e...想要root跟着动,把这句解注释就可以了吧(https://github.com/Arthur151/ROMP/blob/e30b7d17f13089fa9fa114df494192e31b0f43ed/romp/exports/convert_fbx.py#L190)

yuchen-ji commented 2 years ago

您好,非常感谢您的解答! 还想请问您一些问题,可能有一些理解不对的地方:

  1. 我发现您的论文中采用了弱透视相机,请问这样的话能估计出每个人在相机坐标系的深度嘛(好像弱透视相机采用的是平均深度,这样的话每个人的深度是不是一样的呢)
  2. 我如果想采用已知相机内参,那我应该如何对代码进行改动,来获得人在相机坐标系中的绝对位置呢,是否需要重新训练呢 再次感谢您的解答~
Arthur151 commented 2 years ago

1.如果要求深度的话,请用我们最新开源的BEV, 采用的预定义的透视投影方案。会输出更好的深度信息。 2.不需要重新训练。可以参见上面我和Zhengdi的讨论,已经说明了采用已知相机内参求得对应相机空间的3D translation的方式,如果有不明白的地方可以再问我。

yuchen-ji commented 2 years ago

谢谢您的解答,我去尝试一下! 我用了您的romp后,发现也有深度信息,他是通过scale求出来的嘛~

sylyt62 commented 2 years ago

@Arthur151 Hey! I'm confused. Why don't I have the 'cam_trans' that you are talking about? Is it an equivalent to 'cam'?

>>> data[0][0].keys()
dict_keys(['params', 'centers_pred', 'centers_conf', 'verts', 'joints', 'smpl_face'])
>>> data[0][0]['params'].keys()
dict_keys(['cam', 'global_orient', 'body_pose', 'betas', 'poses'])
Arthur151 commented 2 years ago

Hi, @sylyt62

Could you please provide the code/command that you run to get the results?

sylyt62 commented 2 years ago

I followed the instruction:

romp --mode=video --calc_smpl --render_mesh --input=.\demo\videos\camela1.mp4 --save_path=.\demo\videos\camela1_virtual2\results.mp4

Arthur151 commented 2 years ago

pip install --upgrade simple-romp

This could be a missing from old verson before 0.1.0

sylyt62 commented 2 years ago

Indeed! I got it, thx~