NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 369 forks source link

Nice Results from Tocotron #382

Closed SibtainRazaJamali closed 5 years ago

SibtainRazaJamali commented 5 years ago

I want to use this model to train on multi speaker dataset like VCTK. Please guide me to modify the configuration files.

oytunturk commented 5 years ago

Hi,

First step is to include your own dataset in:

https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/text2speech/tacotron_gst.py

and also prepare train/validation/test csv files similar to existing datasets. Make sure fft-size and # spectral features make sense depending on the sampling rate. There are some hard-coded values which make it a bit difficult to track but you should be able to figure them out if you check:

https://github.com/NVIDIA/OpenSeq2Seq/blob/master/open_seq2seq/data/text2speech/text2speech.py

If your inputs are characters and if you have any characters not covered by:

https://github.com/NVIDIA/OpenSeq2Seq/blob/master/open_seq2seq/test_utils/vocab_tts.txt

you'd want to create a new vocab that covers them. The easiest way would be to write a script that automatically generates the csv files + vocab given your dataset.

On Fri, Mar 15, 2019 at 9:24 AM Sibtain Raza Jamali < notifications@github.com> wrote:

I want to use this model to train on multi speaker dataset like VCTK. Please guide me to modify the configuration files.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/OpenSeq2Seq/issues/382, or mute the thread https://github.com/notifications/unsubscribe-auth/AFgIJ43drgkOhsf5krn3-0MzedjzzriVks5vW8kggaJpZM4b21qe .

SibtainRazaJamali commented 5 years ago

Thank you Sir, for quick response. But you have not given me any information on how to pass speaker information.How do we choose the number of speaker parameter and how do we infer for every different speaker?

oytunturk commented 5 years ago

During inference, you pass any wav file for conditioning for the target speaker along with the text you want to synthesize. No need to pass speaker labels for training since training wav files are used to learn 'global style tokens' which are hopefully speaker identities in this case.

Please check the Tacotron GST paper for details: https://arxiv.org/abs/1803.09017

On Fri, Mar 15, 2019 at 5:29 PM Sibtain Raza Jamali < notifications@github.com> wrote:

Thank you Sir, for quick response. But you have not given me any information on how to pass speaker information.How do we choose the number of speaker parameter and how do we infer for every different speaker?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/OpenSeq2Seq/issues/382#issuecomment-473480400, or mute the thread https://github.com/notifications/unsubscribe-auth/AFgIJ5tSLas-zruP1u-ukmpyfMP2371Nks5vXDr2gaJpZM4b21qe .

SibtainRazaJamali commented 5 years ago

i got it thank you

SibtainRazaJamali commented 5 years ago

I have tried to use vctk for 109 speakers with existing configuration of tacotron_gst. But i am not getting nice results. image

blisc commented 5 years ago

109 speakers is a lot. I not surprised that it is unable to learn the variety of speaking and recording styles. If you do manage to get it to work, please open a PR because that would be a cool addition to our speech synthesis models!