YuanxunLu / LiveSpeechPortraits

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)
MIT License
1.16k stars 200 forks source link

about Data preprocess #9

Closed X-niper closed 2 years ago

X-niper commented 2 years ago

Impressive job!

I wonder how to preprocess the images. Specifically, could you please share the scripts on choosing the four candidate images from the sequences and how to draw the shoulder edges since the landmark detectors I have found are all face landmark detectors.

Thanks !

YuanxunLu commented 2 years ago

For the data preprocessing, please check section 4.1 in the paper. For the candidate images selection, please check section 3.4 in the paper. For the shoulder edges drawing, please also check section 4.1 in the paper. Specifically, shoulder points are not automatically detected. We manually selected once for the first frame and tracked them using optical flow. Learning-based methods, e.g., raft may work better.

Feel free to comment if you have any questions.

DWCTOD commented 2 years ago

For the data preprocessing, please check section 4.1 in the paper. For the candidate images selection, please check section 3.4 in the paper. For the shoulder edges drawing, please also check section 4.1 in the paper. Specifically, shoulder points are not automatically detected. We manually selected once for the first frame and tracked them using optical flow. Learning-based methods, e.g., raft may work better.

Feel free to comment if you have any questions.

大佬您好,感谢大佬的分享。 对数据集处理步骤的理解: 1、对人脸区域进行裁剪;2、人脸关键点检测、跟踪;3、肩膀部分的处理,手动选择第一帧中的某些点,后续使用光流去跟踪;4、相机标定。 数据集处理部分存在的疑问,这里对肩膀边缘选取的时候,应该是二维的坐标点吧,可是数据集“shoulder_points3D” (是三维的点吧)要如何获得? 另外图片合成部分: 1、输入的图片:13(1 + 3 × 4) × 512 × 512 这个 13 的具体含义? 2、没太看懂这4张Candidate的图片是如何选取?有两张是选择数据集中嘴唇区域第100大/小的图片,另外两张是对剩余的数据集等间隔沿x轴和y轴旋转采样,并取其中最接近的两个样本?(后面两张的选择没看明白具体是如何操作的) image

YuanxunLu commented 2 years ago
  1. Your understanding of dataset processing is roughly right. The shoulder is modeled as a billboard, the depth of the billboard is set as the average depth of the facial landmarks in the training sequences. Please check section 3.3 in the paper. Yep, tracked shoulder points are 2D at first, but we can reconstruct them into 3D space (using billboard assumption & camera calibration).
  2. 13 is the channel number of input feature maps to the renderer. Here, 13 = 1 (edge feature map) + 3 (RGB) * 4 (4 candidate images)
  3. Candidate images mainly work as an indicator to the renderer to tell the environment, i.e. background as illustrated in the paper. For the first two, we compute the mouth area for all training frames and rank them to choose the desired two. For the latter, we first rank the rotation angles of the head (3D face reconstruction results) by the x- and y-axis, and choose the samples most closest to the uniform intervals. Specifically, here we choose two images so that two intervals points are chosen and we find the two closest frames to these intervals respectively. By the way, the number (4) of candidate images is also a choice only. You can even try to train a network without them and find it still works as long as your training frames don't include the changing camera parameters.
DWCTOD commented 2 years ago
  1. Your understanding of dataset processing is roughly right. The shoulder is modeled as a billboard, the depth of the billboard is set as the average depth of the facial landmarks in the training sequences. Please check section 3.3 in the paper. Yep, tracked shoulder points are 2D at first, but we can reconstruct them into 3D space (using billboard assumption & camera calibration).
  2. 13 is the channel number of input feature maps to the renderer. Here, 13 = 1 (edge feature map) + 3 (RGB) * 4 (4 candidate images)
  3. Candidate images mainly work as an indicator to the renderer to tell the environment, i.e. background as illustrated in the paper. For the first two, we compute the mouth area for all training frames and rank them to choose the desired two. For the latter, we first rank the rotation angles of the head (3D face reconstruction results) by the x- and y-axis, and choose the samples most closest to the uniform intervals. Specifically, here we choose two images so that two intervals points are chosen and we find the two closest frames to these intervals respectively. By the way, the number (4) of candidate images is also a choice only. You can even try to train a network without them and find it still works as long as your training frames don't include the changing camera parameters. 谢谢大佬的回复。 看到13(1+3x4)x512x512,还以为是整体再乘13的,这下明白了。 还有一个问题,其实我之前也试过其他方法,例如:https://github.com/xinwen-cs/AudioDVP 使用 image-to-image translation 的方法,都会有明显的帧间抖动的情况(因为检测的人脸关键点帧间抖动明显),以及生成的效果不稳定(例如牙齿、头发等区域会不停的变化);特别是当对例如这里的 conditional feature map 进行编辑时,编辑后的feature map不在训练数据集中时,效果就会变得非常的差。
YuanxunLu commented 2 years ago

Time-incosistency between generated frames is a common issue in Image-to-Image translation methods, especially when it is applied to generate sequence frames. The keypoint here is that: Semantic ambiguity exists in the training dataset. To alleviate this phenomenon, one should try to keep the training set semantic consistency. For example, you may need to carefully cut and crop the face region and make it consistent for all frames, smooth the detected facial landmarks to reduce the temporal jitters, add more conditions on frequent-changed regions. Also, you can apply a time-conditional schedule, e.g., feed the history generated frames to the renderer as an additional condition. Anyway, you should carefully check your dataset and design your network schedule because it is the ambiguity that cause the issue.

Editing the feature maps far away from the training span will lead to artifacts - that is for sure a common issue for any learning method. You must edit the feature maps using samples in the training span or it will generate artifacts. Actually, calling it as an issue is not appropriate, because it is reasonable that the network should only work well when input in the training span, right? After all, networks can't imagine. If you want to edit the feature map in a large scale, the thorough solution is to train generalized models using a large enough dataset, and that is a much harder problem. Anyway, you must set some rules on the models, for example, at least the feature maps should look like humans -- that is models will fail when you edit the landmarks like a dog. Any model has its borders, just like a saying goes, rubbish in, rubbish out.