fudan-generative-vision / hallo

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation
https://fudan-generative-vision.github.io/hallo/
MIT License
9.41k stars 1.29k forks source link

Artifacts and Lip Sync Issues in Model Training - Seeking Advice from Authors \n 模型训练中的伪影和嘴形同步问题 - 寻求作者建议 #170

Open skywalker00001 opened 3 months ago

skywalker00001 commented 3 months ago

https://github.com/user-attachments/assets/8e33dcd4-de7a-4ce9-8fbc-ac74c2494fa7

https://github.com/user-attachments/assets/194cf8f6-6766-42ae-aa32-dd1b3389f02d

Hi everyone, I have been working on training a talking face model using the Hallo code, but I've encountered several issues that I need some advice on. We used a dataset comprising 32 hours of VFHQ and 12 hours of HDTF videos, without performing any data cleaning.

Issue Description:

  1. Background Artifacts: Large-scale "blotchy" artifacts appear in the background, with many of these artifacts resembling numerous "hands." We suspect this might be due to the presence of hand gestures in the dataset. Did you perform any data cleaning to remove frames with hands or other unwanted elements? Or have you trained a version without data cleaning, and did you encounter similar issues?
  2. Lip Sync Mismatch: The lip movements do not match the audio accurately. Despite aligning our training parameters, steps, and resources with the original code, the synchronization between audio and lip movements is significantly worse than the results achieved with the pre-trained model provided by the authors. Did you use any specific tricks or techniques to improve lip synchronization?

Training Details: Model Architecture: Hallo code for talking face generation Dataset: 32 hours of VFHQ + 12 hours of HDTF videos (uncleaned) Training Parameters: Aligned with the parameters provided in the original code

Request for Advice: Has anyone encountered similar issues with background artifacts and lip sync mismatch in talking face models? Are there any recommended data cleaning steps or techniques to mitigate these artifacts and improve lip synchronization? Any insights or suggestions would be greatly appreciated! Thank you in advance for your help!

大家好,

我最近在使用Hallo代码训练一个谈话脸模型时遇到了一些问题,需要大家的建议。我们使用了包含32小时VFHQ和12小时HDTF视频的数据集,未进行数据清洗工作。

问题描述:

  1. 背景伪影:背景中出现大规模的“花”伪影,许多伪影看起来像无数只“手”。我们怀疑这可能是数据集中存在手势的原因。请问作者们是否进行了去除了带有手部或其他不需要元素的帧的数据清洗;或者是有没有训练过未经数据清洗的版本,是否会跟我们出现一样的问题?
  2. 嘴形音频不同步:嘴部动作与音频严重不同步。尽管我们的训练参数、步数和资源都与原始代码对齐,但音频与嘴部动作的同步效果远不如作者开源的模型权重。请问作者是否使用了什么技巧来提高嘴形同步效果?

训练详情: 模型架构:用于谈话脸生成的Hallo代码 数据集:32小时VFHQ + 12小时HDTF视频(未清洗) 训练参数:与原始代码提供的参数对齐

请求建议: 有没有人遇到过类似的谈话脸模型中的背景伪影和嘴形不同步问题? 是否有推荐的数据清洗步骤或技术可以减轻这些伪影并提高嘴形同步效果? 任何见解或建议都将不胜感激! 提前感谢大家的帮助!

progrobe commented 3 months ago

hi, i also encountered similar issues like yours. Did you find any solution to mitigate these artifacts or improve lip synchronization?

skywalker00001 commented 2 months ago

hi, i also encountered similar issues like yours. Did you find any solution to mitigate these artifacts or improve lip synchronization?

It seems that the only thing that can be done is to clean the dataset. As long as a high-quality dataset is used, the aforementioned problems will not occur.