Closed arunraja-hub closed 3 years ago
@arunraja-hub The mel generation models can do multispeaker, you just have to modify the dataloader. You can see how I did it for FastSpeech2 here. This one takes a -speakers.npy
file for every utt id containing [speaker_id]
as numpy int32 in a dedicated folder in each dump (val, train). You also have to modify the config for training and use the appropriate speaker id at inference time. You'll have to write the script that outputs those files yourself.
There's also a multispeaker LibriTTS example here: https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/fastspeech2_libritts
Thank you very much! Another question: Is it possible to produce multi-speaker data by ourselves or is the compute power required only feasible by Google? I just want to get my expectations right regarding producing multi-speaker data for the MGB-1 dataset
@arunraja-hub By producing data I assume you mean training model. And no, training doesn't require that much computing power, one V100 which you can rent for cheap hourly rates from various sites is enough.
@ZDisket @dathudeptrai I modified the processor , mixed ljspeech dataset and Baker dataset as English and Chinese data, and assigned them speaker ID, (Chinese is 0, English is 1)
I modified tacotron2 model according to the example of multispeaker libritts. Speakerid is added to the training process.
But when I give the speaker ID [0] or [1] in inference, I only get the voice of speaker ID 0, which is totally incorrect in English
What did I do wrong? Any suggestion? Any help to resolve this would be greatly appreciated!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
In Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron, the authors have implemented multi-speaker models by conditioning on speaker identity.
However, in E2E-TensorflowTTS Demo colab notebook, the example only gives the audio output for one speaker.
How can I generate multi-speaker audio outputs using TensorflowTTS? Is there any colab notebook that illustrates this? Any code and implementation advice would be useful.