Open BingliangLi opened 11 months ago
Hi, to my knowledge, such a failure is mainly in the poor correlation between audio features and visual appearance. Obvious lip movements can be observed easily in very early epochs if the training is smooth going.
Chinese speech videos are much more difficult due to the lack of high-quality pretrained audio extractors. We do not suggest DeepSpeech and the HuBERT weight "jonatasgrosman/exp_w2v2t_zh-cn_hubert_s449"
in hubert_cn.py
for Chinese. Instead, we found "facebook/hubert-large-ls960-ft"
in hubert.py
performs better, although it is trained on English speeches rather than Chinese.
The Obama weight is trained from our given script. And the Chinese demo shown on our project page is generated with the same hyperparameters but HuBERT audio feature. So you can try the provided HuBERT or Wav2Vec and see if the lip movements can be built successfully.
Thank you for the suggestions, I will train my models again with the new audio features and post my updates here when I'm done :)
Thank you for the suggestions, I will train my models again with the new audio features and post my updates here when I'm done :)
I have watched your video again and found the training audio in the second one sounds a little low in volume and unclear, which is also not good for the audio feature extraction. You may have to enhance it if the result is still undesirable.
After tweaking the audio(enhance the volume, remove the background noise), I tried a few HuBERT models(TencentGameMate/chinese-hubert-large, facebook/hubert-large-ls960-ft, etc), can't get a good lip sync result, however the human figure is better than Geneface.
Closing the issue for now, I will look deeper into this later.
The problem is caused by --emb
mode, do not use it when using HuBERT feature. Also the facebook/hubert-large-ls960-ft is better than TencentGameMate/chinese-hubert-large even the train data is in Chinese.
Use facebook denoiser to preprocess training audio can also improve the result.
The results are quite impressive, lips are even more accurate than Wav2lip, also no post-processing is needed for er-nerf.
https://github.com/Fictionarry/ER-NeRF/assets/49446651/6638fe6a-48aa-4146-8a13-1f08c9afeede
I have a few followup questions:
Thanks again for this great framework!
helo 我们在做模型的测试 但是和楼主 一样出现了类似的一些问题 包括头部不稳定、头发比较假、左侧头发没有展示出正确的遮挡关系、还有口型比较模糊、衣服上的部分文理有闪跳白光灯
https://github.com/Fictionarry/ER-NeRF/assets/141602355/659610d4-5310-4d25-94a7-ab36be33fbd1
我们使用的是 facebook/hubert-large-ls960-ft
背景没有明显噪音 声音也足够大 训练视频5分钟 人物为国人女性
不太清楚具体的问题 请帮忙看戏
Hello, I have used HuBERT's Facebook/hubert-large-ls960-ft model, and have used the 'emb' parameter during training and inference. Why is the mouth shape still completely misaligned and unable to open during inference?
Hello, I have used HuBERT's Facebook/hubert-large-ls960-ft model, and have used the 'emb' parameter during training and inference. Why is the mouth shape still completely misaligned and unable to open during inference?您好,我使用了HuBERT的Facebook/hubert-large-ls960-ft模型,并在训练和推理过程中使用了“emb”参数。为什么推理时嘴型还是完全错位,无法张开?
emb参数默认是False,而且BingliangLi说的是:The problem is caused by --emb mode, do not use it when using HuBERT feature.
我的理解是不使用这个参数啊?
各位好,请问,使用HuBERT训练后,生成的视频嘴型和声音完全不对应,而且视频不跟随音频截断,这两方面问题应该怎么调整?我使用的是新闻播报的截取视频,按理说人物和声音、口型都应该是很标准的。 感谢各位回复~ @Fictionarry @BingliangLi
Thanks very much for sharing this great work. I trained the model on two custom videos, with DeepSpeech feature(about 27 PSNR and 0.05 LPIPSm, torso)and HuBERT cn feature(about PSNR = 24 and 0.1 LPIPS, torso). Both results are not ideal, the lips in the results are barely moving at all, the head is keep shaking. I did the training exactly as in your description, what might be the problem here?
All my training videos are about 5 mins, I set the iterations in all the three training stages as same as described in the doc.
https://github.com/Fictionarry/ER-NeRF/assets/49446651/62b1ed83-c3bd-41c1-b61d-c806bb8f5150
https://github.com/Fictionarry/ER-NeRF/assets/49446651/3ce49fe4-1e73-4a89-be40-80d75adc1486
https://github.com/Fictionarry/ER-NeRF/assets/49446651/911bc41f-972e-49a6-97e3-809db81ef2ba
https://github.com/Fictionarry/ER-NeRF/assets/49446651/bcf74d12-09ef-447e-a994-a4ebbdf459cc
If it isn't too bother, could you please share more pretrained models along with it's training data and the exact commands used to train the model? And what might be the reason for my results?