chungyiweng / humannerf

HumanNeRF turns a monocular video of moving people into a 360 free-viewpoint video.
MIT License
786 stars 86 forks source link

Problems in the coordinate system conversion process #52

Closed Miles629 closed 1 year ago

Miles629 commented 1 year ago

Hi, @chungyiweng , the results of this work are really amazing! I'm very interested in this work and I prepare a custom video. I extract the frames and masks successfully and estimate SMPL, intrinsics, extrinsics by ROMP. Finally I run the train wild process successfully after many failures (haha).

I notice the mainly reasons of my failures are the mistakes of intrinsics and extrinsics. So I compared the differences between the two files (prepare_dataset.py wild and zjumocap ). I found some difference and I have some questions.

In zjumocap, the KRDT can be get directly from the dataset. Camera calibration by chessboard so that camera parameters are stable. The Rh and Th can be get directly from the dataset as well which represents the global_orient and the position of human in world coordinate. (my personal understanding, notice me if anything wrong)

In wild, the intrinsics and extrinsics are estimated with SMPL estimation (I use ROMP), as a result, the camera parameters are NOT stable. I chat with the author of ROMP and know that the output camtrans is the position of a person in camera space. I use the camtrans as the T, and R is [[100][010][001]]. Rh = poses[:3].copy() which is global_orient same as Rh in zjumocap. But Th:

pelvis_pos = tpose_joints[0].copy()
Th = pelvis_pos

Th seems to be the coordinates of root in canonical pose, which is different with Th in zjumocap (my understanding might be wrong, please notice it out if I'm wrong).

I'm curious about how two different meaning of inputs are used in the code.

I checked the code, apply_global_tfm_to_camera can transfer E from world2cam to smpl2cam, it's easy to understand when using zjumocap, but I can hardly understand how it works when Th and extrinsics-T are different from zjumocap. BTW, can you explain rays_intersect_3d_bbox in more detail, I am not sure about the meaning of nominator, d_intersect, p_intersect, and p_intervals.

It seems that my question is a bit long. Thank you for reading it. Please correct me if there is any mistake in my understanding. I am looking forward to your reply.

xiexh20 commented 1 year ago

Hi, I am also facing the problem of converting ROMP output. How did you find the camera intrinsics? As far as I know, ROMP only outputs 3 values for the camera, which is a weak perspective model. But here a perspective model is required.

Dipankar1997161 commented 1 year ago

Hi, I am also facing the problem of converting ROMP output. How did you find the camera intrinsics? As far as I know, ROMP only outputs 3 values for the camera, which is a weak perspective model. But here a perspective model is required.

Hey, I had asked this question to Arthur in ROMP Repo, he gave me the following suggestion https://github.com/Arthur151/ROMP/issues/421#issue-1600146589

Try this. Let me know if you need anything else

Miles629 commented 1 year ago

Apologies for the delayed response. I have resolved the issues at hand and have successfully run humannerf, albeit with unsatisfactory results. ROMP struggles to infer SMPL estimates accurately in frame-by-frame analysis of obstructed bullet screen videos, resulting in jittery outputs.

Dipankar1997161 commented 1 year ago

Apologies for the delayed response. I have resolved the issues at hand and have successfully run humannerf, albeit with unsatisfactory results. ROMP struggles to infer SMPL estimates accurately in frame-by-frame analysis of obstructed bullet screen videos, resulting in jittery outputs.

Could you tell me the Camera parameters you used, since humanenrf requires Proper camera ins and extrinsic, but ROMP provides a weak perspective model camera values