YuanxunLu / LiveSpeechPortraits

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)
MIT License
1.2k stars 209 forks source link

关于详细的训练过程和代码? #3

Closed DWCTOD closed 3 years ago

DWCTOD commented 3 years ago

大佬您好,又来打扰您了,非常抱歉。 想请问一下,后续是否会考虑分享训练相关的代码?以及tensorrt加速的教程? 还有一些疑问, 1、这里的 fps 为啥设置为 60,一般视频都是25,不知道影响会不会很大?(或者说fps25的视频应该做如何调整) 2、采用 73 pre-defined facial landmarks 作为中间件方式,是否可以通过编辑这些 landmarks 来控制输出结果(一些模型泛化效果不好,如果编辑后的landmarks不在训练的数据集中,输出的结果并不理想)

YuanxunLu commented 3 years ago
  1. Releasing training codes are not in the plan currently due to the related company policy. However, many necessary parts for training have been included in the repo, e.g., dataset/loss/model/options/utils files, they will help you construct the training structure easier. For TensorRT, in brief, you can first transfer the Pytorch models (.pkl) to ONNX files (.onnx) and then to the TensorRT files (*.trt).
  2. FPS setting is just a choice. Previous work like ATVG/NVP uses 25FPS and MakeItTalk uses 62.5FPS. Theoretically, higher FPS contains more speaking details but also leads to more training difficulties, which requires more precise and short audio modeling as well as long-time consistency. Therefore, it is a trade-off between prediction precision and learning difficulties. If you want to train the model in a different FPS, many settings may need to be changed for the best results.
  3. landmarks are intermediate representations for final rendering results, and of course, you can edit them and control the final renderings, e.g. head pose/mouth editing. If edited landmarks are far outside the training corpus span, the models degrade and performance becomes worse -- that is the common issue of learning-based methods. And further (may not relate to this issue), it is also a trade-off between the generalization (one-shot methods, e.g., ATVG/MakeItTalk) and the specialization (personalized methods, e.g., NVP/SynthesizingObama). The choice of the method depends on the requirements of your targets, after all currently there is no method to do both best.
DWCTOD commented 3 years ago
  1. Releasing training codes are not in the plan currently due to the related company policy. However, many necessary parts for training have been included in the repo, e.g., dataset/loss/model/options/utils files, they will help you construct the training structure easier. For TensorRT, in brief, you can first transfer the Pytorch models (.pkl) to ONNX files (.onnx) and then to the TensorRT files (*.trt).
  2. FPS setting is just a choice. Previous work like ATVG/NVP uses 25FPS and MakeItTalk uses 62.5FPS. Theoretically, higher FPS contains more speaking details but also leads to more training difficulties, which requires more precise and short audio modeling as well as long-time consistency. Therefore, it is a trade-off between prediction precision and learning difficulties. If you want to train the model in a different FPS, many settings may need to be changed for the best results.
  3. landmarks are intermediate representations for final rendering results, and of course, you can edit them and control the final renderings, e.g. head pose/mouth editing. If edited landmarks are far outside the training corpus span, the models degrade and performance becomes worse -- that is the common issue of learning-based methods. And further (may not relate to this issue), it is also a trade-off between the generalization (one-shot methods, e.g., ATVG/MakeItTalk) and the specialization (personalized methods, e.g., NVP/SynthesizingObama). The choice of the method depends on the requirements of your targets, after all currently there is no method to do both best.

收到,谢谢大佬的回复。再次感谢大佬开源优秀的工作