Closed SibtainRazaJamali closed 5 years ago
Hi,
First step is to include your own dataset in:
https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/text2speech/tacotron_gst.py
and also prepare train/validation/test csv files similar to existing datasets. Make sure fft-size and # spectral features make sense depending on the sampling rate. There are some hard-coded values which make it a bit difficult to track but you should be able to figure them out if you check:
https://github.com/NVIDIA/OpenSeq2Seq/blob/master/open_seq2seq/data/text2speech/text2speech.py
If your inputs are characters and if you have any characters not covered by:
https://github.com/NVIDIA/OpenSeq2Seq/blob/master/open_seq2seq/test_utils/vocab_tts.txt
you'd want to create a new vocab that covers them. The easiest way would be to write a script that automatically generates the csv files + vocab given your dataset.
On Fri, Mar 15, 2019 at 9:24 AM Sibtain Raza Jamali < notifications@github.com> wrote:
I want to use this model to train on multi speaker dataset like VCTK. Please guide me to modify the configuration files.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/OpenSeq2Seq/issues/382, or mute the thread https://github.com/notifications/unsubscribe-auth/AFgIJ43drgkOhsf5krn3-0MzedjzzriVks5vW8kggaJpZM4b21qe .
Thank you Sir, for quick response. But you have not given me any information on how to pass speaker information.How do we choose the number of speaker parameter and how do we infer for every different speaker?
During inference, you pass any wav file for conditioning for the target speaker along with the text you want to synthesize. No need to pass speaker labels for training since training wav files are used to learn 'global style tokens' which are hopefully speaker identities in this case.
Please check the Tacotron GST paper for details: https://arxiv.org/abs/1803.09017
On Fri, Mar 15, 2019 at 5:29 PM Sibtain Raza Jamali < notifications@github.com> wrote:
Thank you Sir, for quick response. But you have not given me any information on how to pass speaker information.How do we choose the number of speaker parameter and how do we infer for every different speaker?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/OpenSeq2Seq/issues/382#issuecomment-473480400, or mute the thread https://github.com/notifications/unsubscribe-auth/AFgIJ5tSLas-zruP1u-ukmpyfMP2371Nks5vXDr2gaJpZM4b21qe .
i got it thank you
I have tried to use vctk for 109 speakers with existing configuration of tacotron_gst. But i am not getting nice results.
109 speakers is a lot. I not surprised that it is unable to learn the variety of speaking and recording styles. If you do manage to get it to work, please open a PR because that would be a cool addition to our speech synthesis models!
I want to use this model to train on multi speaker dataset like VCTK. Please guide me to modify the configuration files.