TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.8k stars 810 forks source link

Multi-speaker TTS #466

Closed arunraja-hub closed 3 years ago

arunraja-hub commented 3 years ago

In Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron, the authors have implemented multi-speaker models by conditioning on speaker identity.

However, in E2E-TensorflowTTS Demo colab notebook, the example only gives the audio output for one speaker.

How can I generate multi-speaker audio outputs using TensorflowTTS? Is there any colab notebook that illustrates this? Any code and implementation advice would be useful.

ZDisket commented 3 years ago

@arunraja-hub The mel generation models can do multispeaker, you just have to modify the dataloader. You can see how I did it for FastSpeech2 here. This one takes a -speakers.npy file for every utt id containing [speaker_id] as numpy int32 in a dedicated folder in each dump (val, train). You also have to modify the config for training and use the appropriate speaker id at inference time. You'll have to write the script that outputs those files yourself. There's also a multispeaker LibriTTS example here: https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/fastspeech2_libritts

arunraja-hub commented 3 years ago

Thank you very much! Another question: Is it possible to produce multi-speaker data by ourselves or is the compute power required only feasible by Google? I just want to get my expectations right regarding producing multi-speaker data for the MGB-1 dataset

ZDisket commented 3 years ago

@arunraja-hub By producing data I assume you mean training model. And no, training doesn't require that much computing power, one V100 which you can rent for cheap hourly rates from various sites is enough.

YoLi-sw commented 3 years ago

@ZDisket @dathudeptrai I modified the processor , mixed ljspeech dataset and Baker dataset as English and Chinese data, and assigned them speaker ID, (Chinese is 0, English is 1)

Selection_001

I modified tacotron2 model according to the example of multispeaker libritts. Speakerid is added to the training process. Selection_002

But when I give the speaker ID [0] or [1] in inference, I only get the voice of speaker ID 0, which is totally incorrect in English

What did I do wrong? Any suggestion? Any help to resolve this would be greatly appreciated!

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.