From ROMP output to metadata.json

@Dipankar1997161 I saw you successfully trained humannerf using ROMP output format and helped some people about it.

I'm trying to train a humannerf with an in-the wild monocular video. 000114

There are multiple people on the picture, but I will just focus on 1 person, so I segment it to get the corresponding masks. 000114

I already have the camera intrinsic matrix of the calibrated camera :

"cam_intrinsics": [
            [
                1552.2911465495156,
                0.0,
                931.7092001511921
            ],
            [
                0.0,
                1689.001362380511,
                467.30436475747547
            ],
            [
                0.0,
                0.0,
                1.0
            ]
        ],

I used ROMP on the masked images to get the pose estimation :

romp --mode=video --calc_smpl --render_mesh -i=<input> -o=<output>

000141

It outputs npz files for each picture with these keys : ['cam', 'global_orient', 'body_pose', 'smpl_betas', 'smpl_thetas', 'center_preds', 'center_confs', 'cam_trans', 'verts', 'joints', 'pj2d_org']

I then create the metadata.json file like this, taking only "smpl_thetas" and "smpl_betas" fields.

"000001": {
        "poses":  romp_out["smpl_thetas"][0],
        "betas": romp_out["smpl_betas"][0],
        "cam_intrinsics": [
            [
                1552.2911465495156,
                0.0,
                931.7092001511921
            ],
            [
                0.0,
                1689.001362380511,
                467.30436475747547
            ],
            [
                0.0,
                0.0,
                1.0
            ]
        ],
        "cam_extrinsics":  np.eye(4)  (If I understood corectly , ROMP world origin is the camera origin)
    },

I have some questions that I can't figure out after looking at the code :

After training for 400k iterations, the model is blank, there is only a white background. Do you have a clue about what I did wrong here?
Maybe related to why this fails : the "poses" contain the relatives rotations of the body + the global pelvis rotation. How does humannerf know the SMPL body location on the picture ?
- More precisely, I see in the paper that the body pose is represented by : where Omega is the rotations ("poses"), but how and where do we get the joints position "J" from the input data ?

I managed to make the pre-preprocessing work with square images.

For those who got stuck on metadata.json, here is what I did :

Resize the image to 512x512 pixels
Extract the masks using Segment-anything (or an other tool)
Use Romp with the command above

If you have original extrinsics and instrinsics, they do not matter here. ROMP uses its own camera parameters and predicts shapes from its own camera model. The "cam_trans" represents the translation from smpl body pose to camera. Humannerf uses this parameter to recover the body translation from camera. The rotation from camera to SMPL is in the first "pose" rotation parameter. Therefore if you also use 512x512 images, you should use :

"poses":  romp_out["smpl_thetas"][0],
"betas": romp_out["smpl_betas"][0],
"cam_intrinsics": [
    [
        443.4,
        0.0,
        256.0
    ],
    [
        0.0,
        443.4,
        256.0
    ],
    [
        0.0,
        0.0,
        1.0
    ]
],
"cam_extrinsics": [
    [
        1.0,
        0.0,
        0.0,
        romp_out["cam_trans"][0][0]
    ],
    [
        0.0,
        1.0,
        0.0,
        romp_out["cam_trans"][0][1]
    ],
    [
        0.0,
        0.0,
        1.0,
        romp_out["cam_trans"][0][2]
    ],
    [
        0.0,
        0.0,
        0.0,
        1.0
    ]
]

chungyiweng / humannerf

From ROMP output to metadata.json #91