zhouzhenneng commented 2 months ago

你好，我之前使用过ER-NERF训练推理过视频，中文的口型准确度还可以，目前尝试使用了talkingGaussian，出现了中文的嘴型对不上的问题，这是我的训练脚本，训练的iteration是默认的，视频素材是5分钟绿幕视频，25fps:

预处理

python data_utils/process.py /root/share/talkingGaussian/train/data/ao_head/ao_head.mp4

牙齿

export PYTHONPATH=./data_utils/easyportrait python ./data_utils/easyportrait/create_teeth_mask.py /root/share/talkingGaussian/train/data/ao_head/

hubert 处理音频

python data_utils/hubert.py --wav /root/audio/cosyVoice_fish_faster.wav

train

bash scripts/train_xx.sh /root/share/talkingGaussian/train/data/ao_head/ /root/share/talkingGaussian/train/trial/ao_head/ 2 --audio_extractor hubert

推理

python synthesize_fuse.py -S /root/share/talkingGaussian/train/data/ao_head/ -M /root/share/talkingGaussian/train/trial/ao_head/ --use_train --audio /root/audio/cosyVoice_fish_faster_hu.npy --dilate --audio_extractor hubert

训练出来的视频片段如下：

https://github.com/user-attachments/assets/92370606-9650-4ef3-af29-8518f6b44367

目前还存在问题：

加了 --dilate参数，牙齿和嘴的缝隙还是会存在
是不是iteration不够，导致中文的嘴型无法对准
修改了train_face.py，但是嘴部还是存在噪点: loss += 0.01 lpips_criterion(image_t.clone()[:, xmin:xmax, ymin:ymax] 2 - 1, gt_image_t.clone()[:, xmin:xmax, ymin:ymax] * 2 - 1).mean()

Fictionarry commented 2 months ago

没见过这么差的同步，看起来根本不是一段的东西，你给的训练过程有些地方也很奇怪，比如bash scripts/train_xx.sh /root/share/talkingGaussian/train/data/ao_head/ /root/share/talkingGaussian/train/trial/ao_head/ 2 --audio_extractor hubert，如果用的是仓库里给的script的话那在bash上指定--audio_extractor hubert是无效的，实际上用的还是deepspeech，但是inference的时候用的又是hubert，按道理会报错。确定整个过程都没问题吗

zhouzhenneng commented 2 months ago

谢谢回复，之前整个训练推理过程没有报错，我重新训练再检查一次请问目前支持hubert吗，怎么通过hubert对中文音频进行训练推理

zhouzhenneng commented 2 months ago

根据readme中： Similar to ER-NeRF, HuBERT is also available. Recommended for situations if the audio is not in English. Specify --audio_extractor hubert when training and testing. 怎么在training和inference中使用hubert

Fictionarry commented 2 months ago

根据readme中： Similar to ER-NeRF, HuBERT is also available. Recommended for situations if the audio is not in English. Specify --audio_extractor hubert when training and testing. 怎么在training和inference中使用hubert

具体在train_xx.sh里，可以看一下

zhouzhenneng commented 2 months ago

根据readme中： Similar to ER-NeRF, HuBERT is also available. Recommended for situations if the audio is not in English. Specify --audio_extractor hubert when training and testing. 怎么在training和inference中使用hubert

具体在train_xx.sh里，可以看一下

谢谢回复，我检查了train_xx.sh 发现之前已经修改过了： dataset=$1 workspace=$2 gpu_id=$3 audio_extractor='hubert' # deepspeech, esperanto, hubert export CUDA_VISIBLE_DEVICES=$gpu_id

所以尽管training的时候，bash传参有问题，但还是正常用了hubert，inference的命令应该是正确的： python synthesize_fuse.py -S /root/share/talkingGaussian/train/data/ao_head/ -M /root/share/talkingGaussian/train/trial/ao_head/ --use_train --audio /root/audio/cosyVoice_fish_faster_hu.npy --dilate --audio_extractor hubert 后续，我把iteration增加到10w步，并关注没有报错信息，您方便给一个邮箱，我把视频和音频素材发给您，你帮忙验证下口型问题

Fictionarry commented 2 months ago

所以尽管training的时候，bash传参有问题，但还是正常用了hubert，inference的命令应该是正确的： python synthesize_fuse.py -S /root/share/talkingGaussian/train/data/ao_head/ -M /root/share/talkingGaussian/train/trial/ao_head/ --use_train --audio /root/audio/cosyVoice_fish_faster_hu.npy --dilate --audio_extractor hubert 后续，我把iteration增加到10w步，并关注没有报错信息，您方便给一个邮箱，我把视频和音频素材发给您，你帮忙验证下口型问题

您发一下我看看吧，邮箱在github主页有

Fictionarry commented 2 months ago

后续：发现问题是由于唇部动作拟合到了无关的面部表情参数上，增大train_face.py line194的惩罚项参数至1e-3后正常

https://github.com/Fictionarry/TalkingGaussian/blob/0792f7ebc8ce6419f2dc4394a785597aa998b86f/train_face.py#L193-L198

Fictionarry / TalkingGaussian

中文口型对不上 #20

预处理

牙齿

hubert 处理音频

train

推理