Closed ZhengdiYu closed 2 years ago
This is an interesting question. I also want to ask about the coordinates of the camera and its settings. @Arthur151
The bottom line is that I converted the model to ONNX, got 2 tensors - center_maps
and params_maps
. I want parse the result and display 2d points from ONNX model (see image)
Another example:
It is worth noting that all the pictures from ONNX were turned in Y, so I turned them manually in the pictures
It seems that the recognition result is correct, but the camera view is from a different angle.
The backbone is the same for ONNX and .pkl file - ResNet50 (from repo)
The Image (512*512) is same for ONNX test and .pkl test (via romp.predict.image
)
But pj2d
is different (and cam_trans
at least). Furthermore, pj2d
not in between [-1; 1]. What I do wrong?
Code to parse ONNX result:
# Prepare CenterMaps
with open('/usr/src/app/dummies/ResNet50ExistThreePersons/ResNet50_CenterMaps.txt') as f:
center_maps_str = f.readline() # Just string with values
center_maps = tensor_from_str(center_maps_str, delimiter=' ').cuda()
center_maps = center_maps.reshape([1, 1, 64, 64])
# Prepare ParamsMaps
with open('/usr/src/app/dummies/ResNet50ExistThreePersons/ResNet50_ParamsMaps.txt') as f:
params_maps_str = f.readline() # Just string with values
params_maps = tensor_from_str(params_maps_str, delimiter=' ').cuda()
params_maps = params_maps.reshape([1, 145, 64, 64])
# Start parsing
outputs = {'center_map': center_maps.float(), 'params_maps': params_maps.float()}
demo_cfg = {'mode': 'parsing', 'calc_loss': False}
meta_data = {
'offsets': torch.Tensor([[512., 512., 0., 0., 0., 0., 0., 0., 0., 0.]])
}
result_parser = ResultParser()
outputs, meta_data = result_parser.parse_maps(outputs, meta_data, demo_cfg)
smpl = SMPLWrapper()
outputs = smpl(outputs, meta_data)
points = outputs['pj2d'].cpu().detach().numpy()
result = []
for i, subpoints in enumerate(points):
result.append((subpoints+ [1,1]) / 2 * [512, 512])
return result
Thanks.
Hi, Zhengdi, @ZhengdiYu
I'm glad to finally know your "name". Ha~
About your question:
1 & 2. In ROMP, it only estimate the scale of people and their x-y translation in image plane. I use PnP algorithm to estimate the corresponding 3D translation via estimate_translation
. PnP algorithm solves the 3D translation of a perspective camera via exploring the mapping function between root-aligned 3D pose and its corresponding 2D pose.
@ArtiX-GP , Hi, Nikita G., I guess maybe you overlooked this: https://github.com/Arthur151/ROMP/blob/e30b7d17f13089fa9fa114df494192e31b0f43ed/romp/lib/models/modelv1.py#L48
B.T.W., we promote the onnx model in simple-romp, please refer to https://github.com/Arthur151/ROMP/tree/master/simple_romp --onnx
您好,请问一下romp能够得到人的相机空间中的坐标吗,我发现输出的fbx他的根部都是(0,0,0)对齐的 刚刚接触这个方面,希望您能稍微解答一下~
Hi,
I mean that I replaced the predicted mesh + cam_trans
with GT_Mesh + its own GT transl
but can't get equivalent results. Do you mean that I should do GT_Mesh - root_position + GT transl
instead GT_Mesh + its own GT transl
?
What is the difference between the estimated cam_trans and GT transl? I'm just wondering is there a way to put the people into camera coordinate system.
Hi, @Arthur151
In ROMP model https://github.com/Arthur151/ROMP/blob/master/simple_romp/romp/model.py this line commented with description not supported by tensorRT
#cam_maps[:, 0] = torch.pow(1.1,cam_maps[:, 0]) # not supported by tensorRT
Does it really turn out that the model cannot be converted to ONNX? :(
@ZhengdiYu , Zhengdi,
Please check this function:
https://github.com/Arthur151/ROMP/blob/704a5ea7f0e8e5041782622b5fc305dbed9733c3/romp/lib/utils/projection.py#L39
Camera coordinate system is defined by the proj_mat
in this function. Therefore, if you want to get the predicted translation is GT Camera coordinate system, you just need to provide the right proj_mat, which is commonly called extrinsic & intrinsic camera matrix / camera projection matrix.
If you understand estimate_translation
, you will know it can transform the 3D translation from our pre-defined camera space to the target one, like GT Camera coordinate system you want here.
ONNX
@ZhengdiYu , Zhengdi, Please check this function:
Camera coordinate system is defined by the
proj_mat
in this function. Therefore, if you want to get the predicted translation is GT Camera coordinate system, you just need to provide the right proj_mat, which is commonly called extrinsic & intrinsic camera matrix / camera projection matrix. If you understandestimate_translation
, you will know it can transform the 3D translation from our pre-defined camera space to the target one, like GT Camera coordinate system you want here.
Thanks! I will look into this, I do have the camera intrinsic.
@ArtiX-GP Come on! You just need to put it in post-processing. We don't have to put it in model. B.T.W, I have make it. Yes, I got the TensorRT model. Please open another issue to discuss other issues.
Thanks a lot! I will try :)
@Arthur151
Finally, so verts+cam_trans (without proj_mat) is actually not in the true camera coordiante system, right?
If I still want to project the GT mesh onto the image while keepoing the rendering code the same as yours, instead of GT_Mesh + its own GT transl,
what should I use to replace verts = verts+cam_trans? Is there a way to do so or should I change the FOV camera?
verts+cam_trans is in our predefined camera space.
You can use estimate_translation to convert it back from GT to our camera space. I suggest to use our new renderer in simple-romp, which is much better.
Defined in https://github.com/Arthur151/ROMP/blob/master/simple_romp/vis_human/main.py
是可以的,请参见这里
如果不去减pelvis_position,就是绝对空间位置了。
谢谢您的指导!我按照您的方式尝试了,root节点的位置是绝对空间位置了 但是root节点是一直锁定的,并不会随着移动 就像这样:他在yuan原视频中是沿着箭头方向向右走动的 但是blender中他只被锁定在了根节点
verts+cam_trans is in our predefined camera space.
You can use estimate_translation to convert it back from GT to our camera space. I suggest to use our new renderer in simple-romp, which is much better.
Defined in https://github.com/Arthur151/ROMP/blob/master/simple_romp/vis_human/main.py
Got it ! Thanks so much for clarification~
您好,非常感谢您的解答! 还想请问您一些问题,可能有一些理解不对的地方:
1.如果要求深度的话,请用我们最新开源的BEV, 采用的预定义的透视投影方案。会输出更好的深度信息。 2.不需要重新训练。可以参见上面我和Zhengdi的讨论,已经说明了采用已知相机内参求得对应相机空间的3D translation的方式,如果有不明白的地方可以再问我。
谢谢您的解答,我去尝试一下! 我用了您的romp后,发现也有深度信息,他是通过scale求出来的嘛~
@Arthur151 Hey! I'm confused. Why don't I have the 'cam_trans' that you are talking about? Is it an equivalent to 'cam'?
>>> data[0][0].keys()
dict_keys(['params', 'centers_pred', 'centers_conf', 'verts', 'joints', 'smpl_face'])
>>> data[0][0]['params'].keys()
dict_keys(['cam', 'global_orient', 'body_pose', 'betas', 'poses'])
Hi, @sylyt62
Could you please provide the code/command that you run to get the results?
I followed the instruction:
romp --mode=video --calc_smpl --render_mesh --input=.\demo\videos\camela1.mp4 --save_path=.\demo\videos\camela1_virtual2\results.mp4
pip install --upgrade simple-romp
This could be a missing from old verson before 0.1.0
Indeed! I got it, thx~
Hi, I have a simple question about ROMP. I have been struggling putting people into their correct relative position, but is it really possible using the root-aligned SMPL meshes without predicting their transl? (And if we have camera param K, will it be possible? )
What is the coordinate system of the vertices that are used for rendering? I think we are predicting camera coordinate system points but root-aligned, correct?
Following Q1, before rendering verts onto image, there is a trans added to verts (
'cam_trans in projection.py'
) What is it? and what isestimate_translation
actually doing? Is this estimating root's position? https://github.com/Arthur151/ROMP/blob/e30b7d17f13089fa9fa114df494192e31b0f43ed/romp/lib/visualization/visualization.py#L61I tried to replace the verts +trans in Q2 with GT mesh, so verts=GT_verts, without any other changes to your code, but the results are not correct, I expect it to be fully matched the person on the image but there are always shifts, and I also can't use the same FOV otherwise it would be a very small mesh on the image.
Sorry if I understand anything wrong. I think rendering is the final part I didn't understand in your code. Looking forward you for your answer!
Zhengdi