Fictionarry / TalkingGaussian

[ECCV'24] TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting
https://fictionarry.github.io/TalkingGaussian/
213 stars 28 forks source link

中文口型对不上 #20

Closed zhouzhenneng closed 2 months ago

zhouzhenneng commented 2 months ago

你好,我之前使用过ER-NERF训练推理过视频,中文的口型准确度还可以,目前尝试使用了talkingGaussian,出现了中文的嘴型对不上的问题,这是我的训练脚本,训练的iteration是默认的,视频素材是5分钟绿幕视频,25fps:

预处理

python data_utils/process.py /root/share/talkingGaussian/train/data/ao_head/ao_head.mp4

牙齿

export PYTHONPATH=./data_utils/easyportrait python ./data_utils/easyportrait/create_teeth_mask.py /root/share/talkingGaussian/train/data/ao_head/

hubert 处理音频

python data_utils/hubert.py --wav /root/audio/cosyVoice_fish_faster.wav

train

bash scripts/train_xx.sh /root/share/talkingGaussian/train/data/ao_head/ /root/share/talkingGaussian/train/trial/ao_head/ 2 --audio_extractor hubert

推理

python synthesize_fuse.py -S /root/share/talkingGaussian/train/data/ao_head/ -M /root/share/talkingGaussian/train/trial/ao_head/ --use_train --audio /root/audio/cosyVoice_fish_faster_hu.npy --dilate --audio_extractor hubert

训练出来的视频片段如下:

https://github.com/user-attachments/assets/92370606-9650-4ef3-af29-8518f6b44367

目前还存在问题:

  1. 加了 --dilate参数,牙齿和嘴的缝隙还是会存在
  2. 是不是iteration不够,导致中文的嘴型无法对准
  3. 修改了train_face.py, 但是嘴部还是存在噪点: loss += 0.01 lpips_criterion(image_t.clone()[:, xmin:xmax, ymin:ymax] 2 - 1, gt_image_t.clone()[:, xmin:xmax, ymin:ymax] * 2 - 1).mean()
Fictionarry commented 2 months ago

没见过这么差的同步,看起来根本不是一段的东西,你给的训练过程有些地方也很奇怪,比如bash scripts/train_xx.sh /root/share/talkingGaussian/train/data/ao_head/ /root/share/talkingGaussian/train/trial/ao_head/ 2 --audio_extractor hubert,如果用的是仓库里给的script的话那在bash上指定--audio_extractor hubert是无效的,实际上用的还是deepspeech,但是inference的时候用的又是hubert,按道理会报错。确定整个过程都没问题吗

zhouzhenneng commented 2 months ago

谢谢回复,之前整个训练推理过程没有报错,我重新训练再检查一次 请问目前支持hubert吗,怎么通过hubert对中文音频进行训练推理

zhouzhenneng commented 2 months ago

根据readme中: Similar to ER-NeRF, HuBERT is also available. Recommended for situations if the audio is not in English. Specify --audio_extractor hubert when training and testing. 怎么在training和inference中使用hubert

Fictionarry commented 2 months ago

根据readme中: Similar to ER-NeRF, HuBERT is also available. Recommended for situations if the audio is not in English. Specify --audio_extractor hubert when training and testing. 怎么在training和inference中使用hubert

具体在train_xx.sh里,可以看一下

zhouzhenneng commented 2 months ago

根据readme中: Similar to ER-NeRF, HuBERT is also available. Recommended for situations if the audio is not in English. Specify --audio_extractor hubert when training and testing. 怎么在training和inference中使用hubert

具体在train_xx.sh里,可以看一下

谢谢回复,我检查了train_xx.sh 发现之前已经修改过了: dataset=$1 workspace=$2 gpu_id=$3 audio_extractor='hubert' # deepspeech, esperanto, hubert export CUDA_VISIBLE_DEVICES=$gpu_id

所以尽管training的时候,bash传参有问题,但还是正常用了hubert,inference的命令应该是正确的: python synthesize_fuse.py -S /root/share/talkingGaussian/train/data/ao_head/ -M /root/share/talkingGaussian/train/trial/ao_head/ --use_train --audio /root/audio/cosyVoice_fish_faster_hu.npy --dilate --audio_extractor hubert 后续,我把iteration增加到10w步,并关注没有报错信息,您方便给一个邮箱,我把视频和音频素材发给您,你帮忙验证下口型问题

Fictionarry commented 2 months ago

所以尽管training的时候,bash传参有问题,但还是正常用了hubert,inference的命令应该是正确的: python synthesize_fuse.py -S /root/share/talkingGaussian/train/data/ao_head/ -M /root/share/talkingGaussian/train/trial/ao_head/ --use_train --audio /root/audio/cosyVoice_fish_faster_hu.npy --dilate --audio_extractor hubert 后续,我把iteration增加到10w步,并关注没有报错信息,您方便给一个邮箱,我把视频和音频素材发给您,你帮忙验证下口型问题

您发一下我看看吧,邮箱在github主页有

Fictionarry commented 2 months ago

后续:发现问题是由于唇部动作拟合到了无关的面部表情参数上,增大train_face.py line194的惩罚项参数至1e-3后正常

https://github.com/Fictionarry/TalkingGaussian/blob/0792f7ebc8ce6419f2dc4394a785597aa998b86f/train_face.py#L193-L198