Poor results from custom data: lips barely moving, head are not stable 唇形基本不会变化，头部不稳定

BingliangLi commented 11 months ago

Thanks very much for sharing this great work. I trained the model on two custom videos, with DeepSpeech feature(about 27 PSNR and 0.05 LPIPSm, torso)and HuBERT cn feature(about PSNR = 24 and 0.1 LPIPS, torso). Both results are not ideal, the lips in the results are barely moving at all, the head is keep shaking. I did the training exactly as in your description, what might be the problem here?

All my training videos are about 5 mins, I set the iterations in all the three training stages as same as described in the doc.

inference with tts audio (DeepSpeech model)

https://github.com/Fictionarry/ER-NeRF/assets/49446651/62b1ed83-c3bd-41c1-b61d-c806bb8f5150

inference with the training audio using the same npy file during training(DeepSpeech model)

https://github.com/Fictionarry/ER-NeRF/assets/49446651/3ce49fe4-1e73-4a89-be40-80d75adc1486

inference with tts audio (HuBERT cn model)

https://github.com/Fictionarry/ER-NeRF/assets/49446651/911bc41f-972e-49a6-97e3-809db81ef2ba

the pretrained model works fine on the tts audio (DeepSpeech feature), so I believe this has something to do with the train setting? Maybe training more epoch will help?

https://github.com/Fictionarry/ER-NeRF/assets/49446651/bcf74d12-09ef-447e-a994-a4ebbdf459cc

If it isn't too bother, could you please share more pretrained models along with it's training data and the exact commands used to train the model? And what might be the reason for my results?

Fictionarry commented 11 months ago

Hi, to my knowledge, such a failure is mainly in the poor correlation between audio features and visual appearance. Obvious lip movements can be observed easily in very early epochs if the training is smooth going.

Chinese speech videos are much more difficult due to the lack of high-quality pretrained audio extractors. We do not suggest DeepSpeech and the HuBERT weight "jonatasgrosman/exp_w2v2t_zh-cn_hubert_s449" in hubert_cn.py for Chinese. Instead, we found "facebook/hubert-large-ls960-ft" in hubert.py performs better, although it is trained on English speeches rather than Chinese.

The Obama weight is trained from our given script. And the Chinese demo shown on our project page is generated with the same hyperparameters but HuBERT audio feature. So you can try the provided HuBERT or Wav2Vec and see if the lip movements can be built successfully.

BingliangLi commented 11 months ago

Thank you for the suggestions, I will train my models again with the new audio features and post my updates here when I'm done :)

Fictionarry commented 11 months ago

Thank you for the suggestions, I will train my models again with the new audio features and post my updates here when I'm done :)

I have watched your video again and found the training audio in the second one sounds a little low in volume and unclear, which is also not good for the audio feature extraction. You may have to enhance it if the result is still undesirable.

BingliangLi commented 10 months ago

After tweaking the audio(enhance the volume, remove the background noise), I tried a few HuBERT models(TencentGameMate/chinese-hubert-large, facebook/hubert-large-ls960-ft, etc), can't get a good lip sync result, however the human figure is better than Geneface.

Closing the issue for now, I will look deeper into this later.

BingliangLi commented 10 months ago

The problem is caused by --emb mode, do not use it when using HuBERT feature. Also the facebook/hubert-large-ls960-ft is better than TencentGameMate/chinese-hubert-large even the train data is in Chinese. Use facebook denoiser to preprocess training audio can also improve the result.

The results are quite impressive, lips are even more accurate than Wav2lip, also no post-processing is needed for er-nerf.

https://github.com/Fictionarry/ER-NeRF/assets/49446651/6638fe6a-48aa-4146-8a13-1f08c9afeede

I have a few followup questions:

Does more training data lead to better results? How much(long) data is the maximum capacity for the model to learn? (I'm planning to use 10~15min data to improve the results)
Can we explicitly control the torso, head direction in real time? Basically I want to create an idle mode similar to the one in Sadtalker https://github.com/OpenTalker/SadTalker/discussions/386#discussion-5289020
For realtime inference streaming, are there any recommended framework to use?

Thanks again for this great framework!

Michaelho2019 commented 10 months ago

helo 我们在做模型的测试但是和楼主一样出现了类似的一些问题包括头部不稳定、头发比较假、左侧头发没有展示出正确的遮挡关系、还有口型比较模糊、衣服上的部分文理有闪跳白光灯

https://github.com/Fictionarry/ER-NeRF/assets/141602355/659610d4-5310-4d25-94a7-ab36be33fbd1

我们使用的是 facebook/hubert-large-ls960-ft
背景没有明显噪音声音也足够大训练视频5分钟人物为国人女性

不太清楚具体的问题请帮忙看戏

ligyvip commented 9 months ago

Hello, I have used HuBERT's Facebook/hubert-large-ls960-ft model, and have used the 'emb' parameter during training and inference. Why is the mouth shape still completely misaligned and unable to open during inference?

husthzy commented 8 months ago

Hello, I have used HuBERT's Facebook/hubert-large-ls960-ft model, and have used the 'emb' parameter during training and inference. Why is the mouth shape still completely misaligned and unable to open during inference?您好，我使用了HuBERT的Facebook/hubert-large-ls960-ft模型，并在训练和推理过程中使用了“emb”参数。为什么推理时嘴型还是完全错位，无法张开？

emb参数默认是False，而且BingliangLi说的是：The problem is caused by --emb mode, do not use it when using HuBERT feature.

我的理解是不使用这个参数啊？

ajiansoft commented 8 months ago

各位好，请问，使用HuBERT训练后，生成的视频嘴型和声音完全不对应，而且视频不跟随音频截断，这两方面问题应该怎么调整？我使用的是新闻播报的截取视频，按理说人物和声音、口型都应该是很标准的。感谢各位回复~ @Fictionarry @BingliangLi

Fictionarry / ER-NeRF

Poor results from custom data: lips barely moving, head are not stable 唇形基本不会变化，头部不稳定 #16