ShenhanQian / VHAP

A complete head tracking pipeline from videos to NeRF/3DGS-ready datasets.
Other
69 stars 10 forks source link

Issues aligning RealityCapture camera parameters with FLAME head model for VHAP calibration #14

Closed Jp-17 closed 11 minutes ago

Jp-17 commented 16 hours ago

Background: Continuing the discussion from the previous issue(https://github.com/ShenhanQian/VHAP/issues/7#issuecomment-2402498210), we have collected our own multi-view human data with 16 camera angles. We are currently using RealityCapture to estimate camera parameters and intend to use these parameters for VHAP calibration.

Problem encountered: The camera parameters obtained from RealityCapture cannot be directly used in the VHAP calibration process.

Exploration of causes:

  1. First, using RealityCapture for camera parameter estimation, the overall camera array layout estimation appears relatively accurate (as shown in Figures 1 and 2, which are reconstructed human figures and camera arrangements using single-frame images from 16 viewpoints of nersemble_306 & our own captured human subject). Therefore, I estimate that the camera parameters obtained from RealityCapture should theoretically be usable.

  2. However, when directly using the camera parameters obtained from RealityCapture for VHAP calibration, the FLAME landmarks converge together (as shown in Figure 3, red points). Adjusting the camera extrinsic translation to 1/100 or 1/200 of the original partially alleviates this phenomenon (Figure 4) , but still doesn't match the ground truth, with the scale being either too small or too large respectively.

  3. I further examined the point cloud data generated by RealityCapture during the reconstruction of nersemble_306 (Figure 5) and compared it with the point cloud data of nersemble_306 previously obtained from gaussian_avatars training (Figure 6). I found that the xyz values of the two are clearly not on the same scale, with the former's values being significantly larger by over a hundred times (the point cloud data of nersemble_306 obtained from gaussian_avatars training is similar to the xyz data of FLAME points, both between -1 and 1).

  4. Apart from the size issue, when plotting the point clouds from step 3, it can be seen that the orientations of the two 306 point clouds in the world coordinate system are also different (Figure 7).

  5. To further observe the phenomenon in 4, I plotted the FLAME head points and the camera arrays of both (Figure 8, where the translation of the camera parameters obtained from RealityCapture on the right has been divided by 200). Indeed, although the FLAME head is the same, the world coordinates of the camera array estimated by RealityCapture are not coupled with the FLAME head (see Figure 9, where the FLAME head's face is facing the positive z-axis direction, with the head in the positive y-axis direction), while the facial orientation estimated by RealityCapture is not coupled with the z-axis.

Current speculations and areas of confusion:

a. Summarizing the above observations, the scale and world coordinate system (origin and xyz axis directions) of the camera parameters obtained from RealityCapture are inconsistent with the FLAME head. However, the camera parameters provided by the nersemble dataset can fully adapt to FLAME. How is this achieved in nersemble dataset?

b. In practice, I should be able to manually align the z-axis direction by adjusting the orientation of the human figure reconstructed by RealityCapture to face the front view directly, and modify the camera parameter translations by dividing by a scale factor. However, this method is inefficient. I'd like to hear your opinion on how the nersemble dataset construction and VHAP usage process achieve good alignment with the FLAME head.

Thank you!

Fig 1 无标题-1

Fig 2 无标题

Fig 3 无标题

Fig 4 图片

Fig 5 无标题-1

Fig 6 无标题

Fig 7 无标题-1

Fig 8 图片

Figure 9 无标题

ShenhanQian commented 8 hours ago

Hi, thanks for sharing the detailed analysis. The problem is with a scaling factor and the global orientation of the world space.

  1. Since you do not calibrate with a checkerboard but directly run SfM on multi-view images, the camera translations and point cloud are up-to-scaling, meaning the program can rescale these values together with an arbitrary scaling factor without breaking the geometric consistency. So, I was wrong about suggesting 100x or 1000x scaling because the unit of SfM does not correspond to any physical units such as mm or cm. What you can do is to optimize for this scaling factor during the rigid alignment stage with landmark. Basically, you initialize the scaling factor to 1 or exp(0), then use it to scale camera translations before rasterizating landmarks. Then, the landmark loss will help you to get the right scale.

  2. In a similar spirit to the up-to-scale problem, the placement of cameras is also arbitrary, meaning the program can rotate and translate the cameras and point cloud as a whole to an arbitrary space without breaking multiview consistency. This leads to the difference in your Fig. 8. The solution is simple. In the coordinate space of FLAME, assuming we have an OpenGL camera always looking at FLAME (origin), what orientation should it be if we want this camera to view from the front of FLAME? It should be "x-right, y-up, z-backward". If not, we should apply a global rotation to the camera to fix this. For multiple cameras, we can do the same thing to the averaged orientations, which is exactly what we do in this function. Note that, if you have not converted your camera to OpenGL convention, the target orientation should change according to your current convention. That's why this function takes in 'target-convention' as an argument.

Jp-17 commented 4 hours ago

Very clear, thank you, but I have a few more small questions.

  1. I understand now. By the way, are the camera extrinsic translation parameters in the Nersemble dataset and the vertex coordinates in the FLAME model all in meters as units?

2-1. "For multiple cameras, we can do the same thing to the averaged orientations" - this assumes that all cameras are basically symmetrically distributed, right? This way, the averaged camera would be closer to directly facing the face.

2-2. In actual shooting (and in the 16 cameras of the Nersemble dataset), there is actually one camera angle directly facing the subject's face. So, would it be sufficient to rotate this camera's view to meet the FLAME model's requirements (an OpenGL camera looking at origin, "x-right, y-up, z-backward"), and then apply this rotation transformation to the other cameras?

ShenhanQian commented 4 hours ago

1_: Yes

2-1. Yes. Otherwise, you can manually select the middle camera you like.

2-2. Yes. That's exactly what I mean above by "manually select the middle camera".

Jp-17 commented 3 hours ago

Thank you very much! Learn a lot from you!

ShenhanQian commented 3 hours ago

Glad to share what have learnt from others :D