NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 369 forks source link

How to do inference of Tacotron GST? #345

Closed GabrielLin closed 5 years ago

GabrielLin commented 5 years ago

I follow the steps in https://nvidia.github.io/OpenSeq2Seq/html/speech-synthesis/tacotron-2-gst.html to train Tacotron GST. But I do not how to do inference. I encounter errors following the steps of https://nvidia.github.io/OpenSeq2Seq/html/speech-synthesis.html#inference

Thanks.

blisc commented 5 years ago

Good catch, the docs have not been updated for tacotron GST inference. I'll make an update to the docs

In tacotron GST, the first parameter in your csv is actually used and must point to a wav file that you want to condition the style on. So you have to change the following UNUSED | UNUSED | This is an example sentence that I want to generate. to path/to/style.wav | UNUSED | This is an example sentence that I want to generate.

GabrielLin commented 5 years ago

Thank you @blisc . Could you please upload a sample style wav file. I use a file from MAILABS.

en_US/by_book/female/judy_bieber/the_sea_fairies/wavs/the_sea_fairies_08_f000056|UNUSED|It is very strange for the generated sounds.

and run python run.py --config_file=example_configs/text2speech/tacotron_gst.py --mode=infer --logdir=result/tacotron-gst-8gpu/logs --infer_output_file=unused

The generated file is very strange. Its content is totally different from the sentence 'This is an example sentence that I want to generate.'

blisc commented 5 years ago

Here is the generated sample I got using your example and the checkpoint provided on the docs.

I cannot say that the generated audio is strange. Can you upload an audio example?

GabrielLin commented 5 years ago

With using the pre-train mode, I did not meet any problem. But when I trained the model myself. The generated result is very strange. Let me train it again write detail steps.

blisc commented 5 years ago

See #302 for advice on training tacotron GST

GabrielLin commented 5 years ago

I trained my model on another machine and gotthe same situation. I also read #302 . I think my issue is caused by changing the batch_size to 16 since my 1080 TI do not have enough memory. Is it the reason? Thanks.

blisc commented 5 years ago

It seems like batch size does seem to make a difference according to #302. Try adding iter_size to the config.py and set it to 2 to simulate batch size 32.

I would also recommend to train it without a eval set. Use all data as training data

GabrielLin commented 5 years ago

I am afraid that the situation is the same when training with setting iter_size to 2 to simulate batch size 32. I did not use any eval set. I use all the data for training data at the very beginning.

mrgloom commented 5 years ago

What are requirements for style.wav ? it can contain arbitrary sentence? what is minimum / recommended length of style.wav file?

blisc commented 5 years ago

Ideally style.wav should contain similar wavs to what you conditioned tacotron gst on. It would to interesting to experiment with different wavs and see which parameters are most important.