中文音频效果不佳，英文效果确实不错，这是否与训练数据有关？

fudan-generative-vision / hallo

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

https://fudan-generative-vision.github.io/hallo/

MIT License

9.48k stars 1.3k forks source link

中文音频效果不佳，英文效果确实不错，这是否与训练数据有关？ #21

Open henjicc opened 5 months ago

henjicc commented 5 months ago

猜测训练所用数据多为英文，应该没有对中文做任何优化，所以目前中文表现并不是很好

中文测试↓

https://github.com/fudan-generative-vision/hallo/assets/84775360/239d4778-e1f5-43f6-8377-496f8b646e77

英文测试↓

https://github.com/fudan-generative-vision/hallo/assets/84775360/23641926-64de-45bb-a0bb-39059c77bac4

AricGamma commented 5 months ago

是的，音频需要是英文的

现在对输入数据有一些简单的要求，参考这里 https://github.com/fudan-generative-vision/hallo?tab=readme-ov-file#prepare-inference-data 这里有一些数据样例 https://github.com/fudan-generative-vision/hallo/tree/main/examples

Galleons2029 commented 4 months ago

是的，音频需要是英文的

现在对输入数据有一些简单的要求，参考这里 https://github.com/fudan-generative-vision/hallo?tab=readme-ov-file#prepare-inference-data 这里有一些数据样例 https://github.com/fudan-generative-vision/hallo/tree/main/examples

你好，请问目前普通话效果优化方面的工作预计完成时间是多少呢，如果想自己优化的话仅通过视频数据进行训练可行吗？谢谢！

DBDXSS commented 2 months ago

@Galleons2029 你好，基于本项目，我们开源了中文模型的权重，欢迎访问 https://jdh-algo.github.io/JoyHallo/