Multispeaker and new neural voice creation

as-ideas / ForwardTacotron

⏩ Generating speech in a single forward pass without any attention!

https://as-ideas.github.io/ForwardTacotron/

MIT License

579 stars 113 forks source link

Multispeaker and new neural voice creation #88

Open kafan1986 opened 2 years ago

kafan1986 commented 2 years ago

I used the fastpitch model for generating TTS for know speaker. Can I extend this model to multispeakers by using speaker embedding? If yes, then can the solution be used to extended so as to fine tune and mimic a new voice on limited audio data? Has anyone experimented on this path?

cschaefer26 commented 1 year ago

Hi, just to let you know I am currently working on a multispeaker implementation that will be live soon. Fine-tuning is possible with about 5mins of fresh data.

kafan1986 commented 1 year ago

@cschaefer26 I can see you are actively developing multi-speaker implementation in one of the branches. Is it at a stage where I can experiment with it or should I wait some more?

cschaefer26 commented 1 year ago

Hi, yeah I am currently implementing it in the below branch:

https://github.com/as-ideas/ForwardTacotron/tree/feature/multispeaker

Its probably going to be ready in 2 weeks or so. I am currently testing it on the VCTK dataset and cannot guarantee it is working properly. It could be worth a try though if you like, training is implemented, inference will come soon. Use the multispeaker.yaml config, it supports vctk and a variant of the ljspeech format (can be set in preprocessing.audio_format). For the ljspeech format it expects rows as: id|speaker_id|text

kafan1986 commented 1 year ago

@cschaefer26 Thanks for the update. I will wait for another 2 weeks before experimenting with it. GPU time is expensive at my end. But I think you have only made the multi-speaker TTS with forward tacotron and not with FastPitch. Is it so? As per my previous experiment, FastPitch gave slightly better output quality compared to ForwardTacotron, can we get a FastPitch version of the same? Thanks again for all your work.

cschaefer26 commented 1 year ago

Hi, yeah I gonna implement both (ForwardTaco first, then FastPitch) - in my experience ForwardTaco is actually performing better, but it may depend on the dataset...

debasish-mihup commented 1 year ago

@cschaefer26 I can see you are still experimenting through multiple branches. Can you keep one provision for keeping emotion as a parameter. So that apart from providing speaker embedding during training phase, I would also be able to provide emotion type of the audio segment. In case this emotion information is not available, it can be assumed to be of "neutral" emotion.

kafan1986 commented 1 year ago

@cschaefer26 Is the multispeaker branch ready for testing? Also, can you create a branch with Fastpitch?

cschaefer26 commented 1 year ago

Hi multispeaker is merged and ready for testing. I tested it on a custom dataset but as always with such large merges, there may be bugs - pls let me know if you find anything fishy. My colleague @alexteua will work on implementing FastPitch from next week.

@debasish-mihup Currently there is no plan to support emotion conditioning in the vanilla version, but it should be easy to add in a branch if you like. Hint - you can simply concatenate it to the speaker embedding. I would be curious if you are experimenting with an annotated dataset?

rmcpantoja commented 1 year ago

Hi @cschaefer26, Congratulations on the final work on multispeaker. I would like to try this new multi-forward to make a pretrained model with more than 15 Spanish speakers. Each speaker has a dataset, so merging all into one would be a good idea. Each dataset lasts from a minimum of 10 minutes to a maximum of one hour and 30 minutes. How many hours does it take at least to make a decent model?

kafan1986 commented 1 year ago

@cschaefer26 @alexteua Thanks for the multispeaker variant. Is there any progress on the Fastpitch version of it? I could not find a working branch for that. Also, if I have to train the model and work decently for unseen speaker, what should be the usual no. of speakers in the training data for both genders and how much hours per speaker? Any idea based on your experimentation?

alexteua commented 1 year ago

hi @kafan1986 Fastpitch version is coming in the following days

alexteua commented 1 year ago

@kafan1986 multispeaker fastpitch is ready to use ( #95 )