ashawkey / RAD-NeRF

Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition
MIT License
878 stars 153 forks source link

bad video quality after training base on my video. #7

Open ruanjiyang opened 1 year ago

ruanjiyang commented 1 year ago

Dear ashawkey

thanks for your great project. I had exactly followed the process written in the readme, the original video is total 4 minutes (25 fps).
and I have trained 200000 iters for head + additional 50000 iters for fine-tuning the lips. (so total is 250000 iters) but finally I got the synthetic video like this. do you have any suggestion ? how can I get the similar quality synthetic video like the demo Obama video provided by you? thanks lot!

https://user-images.githubusercontent.com/45660925/209437878-28e8a7cf-2192-41e6-a59b-54185c1e39da.mp4

ruanjiyang commented 1 year ago

you can see, the eyes looks very strange, and the speaking lips also looks very strange.

ashawkey commented 1 year ago

@ruanjiyang Hi,

  1. It seems the eyes are not well learned. In this case, you could try to fix the eye movement using --fix_eye 0.25.
  2. The lips sync for non-English datasets is usually worse due to the ASR model.
  3. For the torso, it seems some semantic segmentation is wrong. Training a torso model may help.
ruanjiyang commented 1 year ago

@ruanjiyang Hi,

  1. It seems the eyes are not well learned. In this case, you could try to fix the eye movement using --fix_eye 0.25.
  2. The lips sync for non-English datasets is usually worse due to the ASR model.
  3. For the torso, it seems some semantic segmentation is wrong. Training a torso model may help.

Dear Ashawkey

thanks for your feedback. let me try again.

ruanjiyang commented 1 year ago

I have tried to use Chinese version wav2vec2, see the following line:

parser.add_argument('--model', type=str, default='ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt')

and I found the audio_dim for this model is 21128, which is much large than the 'cpierse/wav2vec2-large-xlsr-53-esperanto' model which is only 44.

Is there anything wrong? should I use such large audio_dim for 'ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt'?

thanks.

ashawkey commented 1 year ago

This is caused by too many chinese character classes. I'm afraid this will be too large for the MLP to work well, but you could try. In fact, character label is not very suitable to guide the lips since we actually needs the sound (phoneme).

Erickrus commented 1 year ago

Many thanks to your contribution! Great work!

I have the same issue.

The dataset is around 5 minutes data (25 fps) talking in Mandarin.

Expressive It seems the lip can open and close based on the voice. However, the shape of lip is not very expressive. I tried to fine-tune lips with more iters , LPIPS loss doesn't improve. Do I need more training data or change audio feature extraction method? Any comments ?

Open during silence When there's no voice, the mouth appear to be open usually. How can I close the lips during silence?

ashawkey commented 1 year ago

@Erickrus Hi, could you check the performance on self-driven testset? Which ASR model are you using? Finetuning lips majorly aims to improve the sharpness, and may not be helpful in enhancing lip-sync.

JuneoXIE commented 1 year ago

Hi @Erickrus The latest chinese deepspeech ASR model deepspeech-0.9.3-models-zh-CN.pbmm might work. I'm trying it.

Erickrus commented 1 year ago

log_ngp.txt after --finetune_lips step

++> Evaluate at epoch 37 ...
PSNR = 26.028605
LPIPS (alex) = 0.082468

Performance on self-driven Testset:

ASR model (by default): cpierse/wav2vec2-large-xlsr-53-esperanto

# try to visualize the audio features
data = np.reshape(data, [data.shape[0], data.shape[1]*data.shape[2]]) # [837, 16*44]
data = (data-np.min(data))/(np.max(data)-np.min(data))
im = Image.fromarray((data * 255.).astype(np.uint8))
im

It seems the feature is not distinguishable from char to char (compared to melspectrogram)

For ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt, maybe could merge logits based on pinyin code

ashawkey commented 1 year ago

Yes, the current audio processing pipeline is quite problematic for chinese...

a312863063 commented 1 year ago

I have tried to use Chinese version wav2vec2, see the following line:

parser.add_argument('--model', type=str, default='ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt')

and I found the audio_dim for this model is 21128, which is much large than the 'cpierse/wav2vec2-large-xlsr-53-esperanto' model which is only 44.

Is there anything wrong? should I use such large audio_dim for 'ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt'?

thanks.

In my experiments, using 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn' is better for Chinese (3503 to 64). Instead, 'ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-cn-gpt' this model will make the mouth static (21128 to 64).

And this Chinese-ASR project is quite useful: https://github.com/chenkui164/FastASR (in real-time)

JuneoXIE commented 1 year ago

@a312863063 Hi, how do you merge the original logits into a low-dimension vector?

a312863063 commented 1 year ago

@a312863063 Hi, how do you merge the original logits into a low-dimension vector?

Hi, you can see how it maps the predicted vector of any dimension to the 64-dimensional features in here. If the input dimension is too high or the predicted vector is not accurate, the effect will not be very good.

I just directly passed the ASR prediction results to AudioNet. Maybe you could do some change to the AudioNet to make it adapt to the new ASR, good luck!

Erickrus commented 1 year ago

@a312863063 Hi, how do you merge the original logits into a low-dimension vector?

Hi, you can see how it maps the predicted vector of any dimension to the 64-dimensional features in here. If the input dimension is too high or the predicted vector is not accurate, the effect will not be very good.

I just directly passed the ASR prediction results to AudioNet. Maybe you could do some change to the AudioNet to make it adapt to the new ASR, good luck!

Is there any improvement , when switching to 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn' ?

cmmclee commented 1 year ago

@a312863063 Hi, how do you merge the original logits into a low-dimension vector?

Hi, you can see how it maps the predicted vector of any dimension to the 64-dimensional features in here. If the input dimension is too high or the predicted vector is not accurate, the effect will not be very good. I just directly passed the ASR prediction results to AudioNet. Maybe you could do some change to the AudioNet to make it adapt to the new ASR, good luck!

Is there any improvement , when switching to 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn' ?

I tried but failed. So how should I change the encoder_conv module of AudioNet? The audio dim_in of 'wav2vec2-large-xlsr-53-chinese-zh-cn' is 3503 which is far more than 44. https://github.com/ashawkey/RAD-NeRF/blob/32a5aba2d102b62a2c0a7adbf4e1e6e7564e8e44/nerf/network.py#L46

a312863063 commented 1 year ago

@a312863063 Hi, how do you merge the original logits into a low-dimension vector?

Hi, you can see how it maps the predicted vector of any dimension to the 64-dimensional features in here. If the input dimension is too high or the predicted vector is not accurate, the effect will not be very good. I just directly passed the ASR prediction results to AudioNet. Maybe you could do some change to the AudioNet to make it adapt to the new ASR, good luck!

Is there any improvement , when switching to 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn' ?

ASR result of model 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn': 大家豪我 c l瑞 月 就 c塞 今日姑远临防 防 连控机止林 开 li良 c场 西 门 发布会 在音月八日 下午三时的发布会 上 姑院 联防连控机制 将介绍第 史版防控发案 地有关情况 国家级控局 相关司局负责 同治和中国 集控中心专家 将初起 逸月期日 院临 防 连控机制以 举 办 西文发布会 介绍了农 村 地区异情流行 期间结 合病毒便意情 况 意情流 行强度 医疗 资源复合 和社会运转 情况综合评 估事时 依法采取离时 性的防控所施 皆少职元 聚集 降低一人院流动 建今感染 者段时期巨 增队社会运行 和医疗 资源等的充击 春杰 吉将莱林 怨在卖 回家 的人能够抱着评 安庸着见 康 拆着幸 福鞋 着 快 乐 漏 cá 温 馨带着田 蜜 先着 才 运麦鲁加 门进请开 心 二年二三年会 是个美好的心 开端

Compositing video is like (NOT SO GOOD with a lot of AMBIGUITY and WRONG PRONOUNCIATION):

cmmclee commented 1 year ago

Did you figure out if ASR accuracy affects lip synthesis? I have tried several chinese ASR, such as 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn' and 'TencentGameMate/chinese-wav2vec2-large'. But the result has not been significantly improved. What about your trails?

flyingshan commented 1 year ago

Hi @Erickrus The latest chinese deepspeech ASR model deepspeech-0.9.3-models-zh-CN.pbmm might work. I'm trying it.

Have you tried this model? I found that the pbmm file format is not compatible with currently used deepspeech model.

JuneoXIE commented 1 year ago

@flyingshan Hi, I tried this pbmm model and found the same problem...I also tried the chinese version wav2vec 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn', but I didn't achieve the same performance as what @a312863063 shows above, probably because my training video is not suitable.

Erickrus commented 1 year ago

Hi @Erickrus The latest chinese deepspeech ASR model deepspeech-0.9.3-models-zh-CN.pbmm might work. I'm trying it.

Have you tried this model? I found that the pbmm file format is not compatible with currently used deepspeech model.

Please notice .pbmm is not equal to .pb, you have to convert it manually from checkpoints. Of course you can rewrite the deepspeech feature part to be compatible to .pbmm format.

You can look into deepspeech.cc

JuneoXIE commented 1 year ago

@ashawkey Hi, sorry for bothering you again... I've trained on three different videos and tried three asr models, including the default wav2vec, deepspeech 0.6.0, and jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn. However I got the reconstructions with totally static faces. I guess the problem is not caused by the asr model. Please give me some suggestions. Thank you!

This is one of my training videos (about 4 min): man_1.zip

This is the reconstruction using jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn: https://user-images.githubusercontent.com/38695396/219374822-147cd71c-b979-4dbb-9bca-aacc0891db09.mp4

ashawkey commented 1 year ago

@JuneoXIE The training video looks good, and I think the default wav2vec model should be able to work (at least not totally static). Could you provide the exact command line you use?

JuneoXIE commented 1 year ago

@JuneoXIE The training video looks good, and I think the default wav2vec model should be able to work (at least not totally static). Could you provide the exact command line you use?

Hi thank you for the response! I double-checked the training parameters and found that I mistakenly set the extracting frame rate to 30 fps while my input video had been transformed to 25 fps. The reconstruction with static lips is caused by the non-aligned training data...

The reconstruction using default wav2vec is good! https://user-images.githubusercontent.com/38695396/220499119-cb13a778-d6cb-42b0-9768-8cf5329ed80f.mp4

boolw commented 1 year ago

@JuneoXIE Hello, we also use the model 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn', but the effect is not satisfactory. I see that you have similar problems. I don't know how your training support for Chinese is now. Yes, looking forward to your reply

flyingshan commented 1 year ago

给大家提供一个思路:因为asr模型提取出的特征是字的概率而非“语音”的概率,而中文字多、且ASR模型容易识别错,导致提取的特征较弱,所以将ASR模型识别出来的‘字”,转为与语音更相关的“拼音”,乃至于声母和韵母,能够对中文提取出更有效的特征,我的实现: code,我的实验结果来说相比原来有所提升,希望对大家有帮助。

cmmclee commented 1 year ago

给大家提供一个思路:因为asr模型提取出的特征是字的概率而非“语音”的概率,而中文字多、且ASR模型容易识别错,导致提取的特征较弱,所以将ASR模型识别出来的‘字”,转为与语音更相关的“拼音”,乃至于声母和韵母,能够对中文提取出更有效的特征,我的实现: code,我的实验结果来说相比原来有所提升,希望对大家有帮助。

我试过您提供的方法,效果并没有提升。还有一点,这种方法对于多音字的情况,会产生新的误差。不知道您的实验效果怎样?有哪些我理解不当的地方?

flyingshan commented 1 year ago

给大家提供一个思路:因为asr模型提取出的特征是字的概率而非“语音”的概率,而中文字多、且ASR模型容易识别错,导致提取的特征较弱,所以将ASR模型识别出来的‘字”,转为与语音更相关的“拼音”,乃至于声母和韵母,能够对中文提取出更有效的特征,我的实现: code,我的实验结果来说相比原来有所提升,希望对大家有帮助。

我试过您提供的方法,效果并没有提升。还有一点,这种方法对于多音字的情况,会产生新的误差。不知道您的实验效果怎样?有哪些我理解不当的地方?

多音字的问题我也没找到办法解决。我实验的时候使用这种音素的方式驱动同步效果更好一些,但是理论上这个方法比较依赖ASR识别的准确度,而models--jonatasgrosman--wav2vec2-large-xlsr-53-chinese-zh-cn 这个模型的ASR准确度不是很高,对于一些语音识别不准确,可能会导致效果下降。

cmmclee commented 1 year ago

@flyingshan 您能否提供一个 demo 视频呢?

flyingshan commented 1 year ago

@flyingshan 您能否提供一个 demo 视频呢?

抱歉,我是在自己拍摄的视频上实验的,没有得到被拍摄人的允许,不太方便发出来哈

huangxin168 commented 1 year ago

I also got blinking eyes result...

91xiaoyang commented 1 year ago

I am conducting/- O -- iters 250000-- finetune_ Lips, error reported: RuntimeError: Given input size: (192x2x2x2) Calculated output size: (192x0x0). Output size is too small, have you encountered it before?

huangxin168 commented 1 year ago

I am conducting/- O -- iters 250000-- finetune_ Lips, error reported: RuntimeError: Given input size: (192x2x2x2) Calculated output size: (192x0x0). Output size is too small, have you encountered it before?

use try: exception: to skip it if encountered error condition.

91xiaoyang commented 1 year ago

我正在进行/- O -- iters 250000-- finetune_ Lips,报错:RuntimeError: Given input size: (192x2x2x2) Calculated output size: (192x0x0). 输出尺寸太小,你以前遇到过吗?

如果遇到错误情况,请使用 try: exception: 跳过它。

This way, I won't be able to train. Is the camera in the video I'm using too far away?

91xiaoyang commented 1 year ago

I run: Python main.py data/obama/-- workspace trial obama torso/ -O --torso --head_ ckpt .pth --iters 200000

It says I am missing parameters. Is your pth necessarily the ngp.pth obtained after training 20000 times?

huangxin168 commented 1 year ago

Hi @Erickrus The latest chinese deepspeech ASR model deepspeech-0.9.3-models-zh-CN.pbmm might work. I'm trying it.

Have you tried this model? I found that the pbmm file format is not compatible with currently used deepspeech model.

Please notice .pbmm is not equal to .pb, you have to convert it manually from checkpoints. Of course you can rewrite the deepspeech feature part to be compatible to .pbmm format.

You can look into deepspeech.cc

Do you konw how to convert .pbmm to .pb from checkpoints?

Update: following this guide: https://docs.openvino.ai/latest/openvino_docs_MO_DG_prepare_model_convert_model_tf_specific_Convert_DeepSpeech_From_Tensorflow.html

I got right .pb on this command: python3 DeepSpeech.py --checkpoint_dir ../deepspeech-0.9.3-checkpoint --export_dir ../

but got error for -zh-CN: python3 DeepSpeech.py --checkpoint_dir ../deepspeech-0.9.3-checkpoint-zh-CN --export_dir ../ WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation. I Exporting the model... I Loading best validating checkpoint from ../deepspeech-0.9.3-checkpoint-zh-CN/best_dev-408475 I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel I Loading variable from checkpoint: layer_1/bias I Loading variable from checkpoint: layer_1/weights I Loading variable from checkpoint: layer_2/bias I Loading variable from checkpoint: layer_2/weights I Loading variable from checkpoint: layer_3/bias I Loading variable from checkpoint: layer_3/weights I Loading variable from checkpoint: layer_5/bias I Loading variable from checkpoint: layer_5/weights I Loading variable from checkpoint: layer_6/bias Traceback (most recent call last): File "DeepSpeech.py", line 12, in ds_train.run_script() File "/DeepSpeech/DeepSpeech-0.9.3/training/deepspeech_training/train.py", line 982, in run_script absl.app.run(main) File "/root/miniconda3/envs/deepspeech/lib/python3.7/site-packages/absl/app.py", line 308, in run _run_main(main, args) File "/root/miniconda3/envs/deepspeech/lib/python3.7/site-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/DeepSpeech/DeepSpeech-0.9.3/training/deepspeech_training/train.py", line 962, in main export() File "/DeepSpeech/DeepSpeech-0.9.3/training/deepspeech_training/train.py", line 811, in export load_graph_for_evaluation(session) File "/DeepSpeech/DeepSpeech-0.9.3/training/deepspeech_training/util/checkpoints.py", line 151, in load_graph_for_evaluation _load_or_init_impl(session, methods, allow_drop_layers=False) File "/DeepSpeech/DeepSpeech-0.9.3/training/deepspeech_training/util/checkpoints.py", line 98, in _load_or_init_impl return _load_checkpoint(session, ckpt_path, allow_drop_layers, allow_lr_init=allow_lr_init) File "/DeepSpeech/DeepSpeech-0.9.3/training/deepspeech_training/util/checkpoints.py", line 71, in _load_checkpoint v.load(ckpt.get_tensor(v.op.name), session=session) File "/root/miniconda3/envs/deepspeech/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func return func(*args, **kwargs) File "/root/miniconda3/envs/deepspeech/lib/python3.7/site-packages/tensorflow_core/python/ops/variables.py", line 1033, in load session.run(self.initializer, {self.initializer.inputs[1]: value}) File "/root/miniconda3/envs/deepspeech/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/root/miniconda3/envs/deepspeech/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1156, in _run (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape()))) ValueError: Cannot feed value of shape (256,) for Tensor 'layer_6/bias/Initializer/zeros:0', which has shape '(29,)'

Erickrus commented 1 year ago

Hi @Erickrus The latest chinese deepspeech ASR model deepspeech-0.9.3-models-zh-CN.pbmm might work. I'm trying it.

Have you tried this model? I found that the pbmm file format is not compatible with currently used deepspeech model.

Please notice .pbmm is not equal to .pb, you have to convert it manually from checkpoints. Of course you can rewrite the deepspeech feature part to be compatible to .pbmm format. You can look into deepspeech.cc

Do you konw how to convert .pbmm to .pb from checkpoints?

Update: following this guide: https://docs.openvino.ai/latest/openvino_docs_MO_DG_prepare_model_convert_model_tf_specific_Convert_DeepSpeech_From_Tensorflow.html

I got right .pb on this command: python3 DeepSpeech.py --checkpoint_dir ../deepspeech-0.9.3-checkpoint --export_dir ../

...

I failed to convert pbmm to pb model.

But, there's another way to do it. Please check the deepspeech.cc, you will find how features are calculated. Save features to a binary file. Of course you need re-build deepspeech solution. Then you can read the binary file with numpy. The data should go all the way through the windowing process and finally you can get the corresponding features from the latest deepspeech model. Hopefully, above explains the approach.

91xiaoyang commented 1 year ago

jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-c Hello, your video is great! Have you received any results from your optimization? Can you teach me how to replace the ASR model specifically?

ShiJiaying commented 1 year ago

Hi @Erickrus The latest chinese deepspeech ASR model deepspeech-0.9.3-models-zh-CN.pbmm might work. I'm trying it.

Have you tried this model? I found that the pbmm file format is not compatible with currently used deepspeech model.

Please notice .pbmm is not equal to .pb, you have to convert it manually from checkpoints. Of course you can rewrite the deepspeech feature part to be compatible to .pbmm format. You can look into deepspeech.cc

Do you konw how to convert .pbmm to .pb from checkpoints? Update: following this guide: https://docs.openvino.ai/latest/openvino_docs_MO_DG_prepare_model_convert_model_tf_specific_Convert_DeepSpeech_From_Tensorflow.html I got right .pb on this command: python3 DeepSpeech.py --checkpoint_dir ../deepspeech-0.9.3-checkpoint --export_dir ../ ...

I failed to convert pbmm to pb model.

But, there's another way to do it. Please check the deepspeech.cc, you will find how features are calculated. Save features to a binary file. Of course you need re-build deepspeech solution. Then you can read the binary file with numpy. The data should go all the way through the windowing process and finally you can get the corresponding features from the latest deepspeech model. Hopefully, above explains the approach.

@Erickrus Hi, you can refer to this link https://github.com/FengYen-Chang/DeepSpeech.OpenVINO/issues/4#issue-1733282125 and get Chinese deepspeech features.