chungyiweng / humannerf

HumanNeRF turns a monocular video of moving people into a 360 free-viewpoint video.
MIT License
786 stars 86 forks source link

From ROMP output to metadata.json #91

Closed Feez2403 closed 5 months ago

Feez2403 commented 5 months ago

@Dipankar1997161 I saw you successfully trained humannerf using ROMP output format and helped some people about it.

I'm trying to train a humannerf with an in-the wild monocular video. 000114

There are multiple people on the picture, but I will just focus on 1 person, so I segment it to get the corresponding masks. 000114

I already have the camera intrinsic matrix of the calibrated camera :

"cam_intrinsics": [
            [
                1552.2911465495156,
                0.0,
                931.7092001511921
            ],
            [
                0.0,
                1689.001362380511,
                467.30436475747547
            ],
            [
                0.0,
                0.0,
                1.0
            ]
        ],

I used ROMP on the masked images to get the pose estimation :

romp --mode=video --calc_smpl --render_mesh -i=<input> -o=<output>

000141

It outputs npz files for each picture with these keys : ['cam', 'global_orient', 'body_pose', 'smpl_betas', 'smpl_thetas', 'center_preds', 'center_confs', 'cam_trans', 'verts', 'joints', 'pj2d_org']

I then create the metadata.json file like this, taking only "smpl_thetas" and "smpl_betas" fields.

"000001": {
        "poses":  romp_out["smpl_thetas"][0],
        "betas": romp_out["smpl_betas"][0],
        "cam_intrinsics": [
            [
                1552.2911465495156,
                0.0,
                931.7092001511921
            ],
            [
                0.0,
                1689.001362380511,
                467.30436475747547
            ],
            [
                0.0,
                0.0,
                1.0
            ]
        ],
        "cam_extrinsics":  np.eye(4)  (If I understood corectly , ROMP world origin is the camera origin)
    },

I have some questions that I can't figure out after looking at the code :

Feez2403 commented 5 months ago

I managed to make the pre-preprocessing work with square images.

For those who got stuck on metadata.json, here is what I did :

If you have original extrinsics and instrinsics, they do not matter here. ROMP uses its own camera parameters and predicts shapes from its own camera model. The "cam_trans" represents the translation from smpl body pose to camera. Humannerf uses this parameter to recover the body translation from camera. The rotation from camera to SMPL is in the first "pose" rotation parameter. Therefore if you also use 512x512 images, you should use :

"poses":  romp_out["smpl_thetas"][0],
"betas": romp_out["smpl_betas"][0],
"cam_intrinsics": [
    [
        443.4,
        0.0,
        256.0
    ],
    [
        0.0,
        443.4,
        256.0
    ],
    [
        0.0,
        0.0,
        1.0
    ]
],
"cam_extrinsics": [
    [
        1.0,
        0.0,
        0.0,
        romp_out["cam_trans"][0][0]
    ],
    [
        0.0,
        1.0,
        0.0,
        romp_out["cam_trans"][0][1]
    ],
    [
        0.0,
        0.0,
        1.0,
        romp_out["cam_trans"][0][2]
    ],
    [
        0.0,
        0.0,
        0.0,
        1.0
    ]
]