as-ideas / ForwardTacotron

⏩ Generating speech in a single forward pass without any attention!
https://as-ideas.github.io/ForwardTacotron/
MIT License
577 stars 113 forks source link

A couple of questions #37

Open jmasterx opened 3 years ago

jmasterx commented 3 years ago

Hi!

I have tried the latest version and I am quite pleased with the results; there is some great progress happening on this repository!

I am using 48KHz 7000 samples of my own voice.

I am very happy with pronunciation.

I had a couple questions: When I have many sentences together, it does not seem to take a pause and sounds like it is rushing through the sentences. Is this normal, is there a workaround? my current one is to add a '...' instead of '.'

My other question is, are there plans for tokenizable pitch, to be able to do things like emphasize a specific word, or to give a work in particular a specific tone (in the text input not automatic)

Thanks!

cschaefer26 commented 3 years ago

Hi, glad you like it. How much data do you have?

  1. If you synth across multiple sentences then you would have to produce training data with multiple sentences as well (just concat them with the desired pause for example). Another option would be to manually mess with the mels and phoneme durations of the data (e.g. add some silence to the mel specs and make sure the dot gets a lot of duration) - because the tacotron extracted durations are not really reliable in this regard. We simply synth each sentence individually and concatenate the wavs with some hard-coded pause, thats probably the easiest and also quality-wise the best.

  2. You could actually do this already using the pitch_function in the colab: https://colab.research.google.com/github/as-ideas/ForwardTacotron/blob/master/notebooks/synthesize.ipynb. For example pitch_func = lambda x: torch.cat([x[:, :, :6] + 1.3, x[:, :, 6:]], dim=-1) to raise the pitch for the first 6 chars.

jmasterx commented 3 years ago

Thanks for the info, I will check that out!

I have around 7,000 samples of my voice and I trained at 48Khz. 11 hours 45 mins.

It is my own dataset that is about 4500 sentences from LJSpeech corpus, 500 from Alice In Wonderland, and 2000 questions from Wikipedia.

Here are some examples using WaveRNN https://vocaroo.com/1ox9ak6O3dHd

One thing that I find strange is why my sentences go flat toward the end: https://vocaroo.com/16ifuiJhlKVB the 'that has never gone out of style' part loses all inflection. I do not understand why...

Most sentences suffer from this. And if I do it twice: https://vocaroo.com/1e2atNcuDSIN I start to sound more and more sad. It is quite evident here https://vocaroo.com/1esyrbcG2B8L I'm not sure why this happens.

cschaefer26 commented 3 years ago

Nice. Sounds quite good already, but imo the WaveRNN could still improve a bit (the gnarling/hissing) - how many steps is this for vocoder and tts? The hissing could also come from not so great durations if the tacotron attention is off.

I've seen some problems with ending pitch for some datasets, mainly male. Did you look at the pitch loss? Maybe its overfitting. Also, it could be a problem of trailing durations being a bit off, maybe trimming some silence would help with this (I just fixed the missing trimming functions in master preprocessing). If that doesnt help you could try to mess around witth the pitch loss function and scale it up at the end of the batch (e.g. multiply the loss with an increasing factor), we tried this already and it seemed to help with ending pitch.

jmasterx commented 3 years ago

Hi

I retrained using the latest repo and it was a bit better but still wound up getting the end pitches wrong by the end of training. It starts out alright but eventually I guess overfits or something.

However, I did try something interesting.

I modified the scripts a bit so that the pitches for each phoneme came from LJSpeech model and got great results like this! I think it could be interesting to have the option to use different models for duration and pitch predictions!

This is using my pitch conditioning: https://vocaroo.com/19bny42DEx2d

Notice the endings become very monotone.

Now here is the same thing but I fed in the pitches from LJSpeech

https://vocaroo.com/19ltwQ1gBJOJ

To me it sounds much better!

I think this could have some interesting applications. I think it could allow to have high quality voices with potentially less forced alignment data!

In any case, I think adding the option to use a different model for duration and or pitch prediction could be interesting!

cschaefer26 commented 3 years ago

Hi, very cool. This is something on my list, I will also try to train multispeaker models which I hope will improve the pitch prediction. I am pretty sure that some transfer learning will benefit the pitch prediction. So far it seems to me that the pitches of male speakers are harder to pick up for the models, maybe it is harder to extract in the first place (to me the female mel specs are much clearer than male ones).

jmasterx commented 3 years ago

Hi

One thing you mentioned was adding silence to the mel spectrogram. I thought I could add silence by playing with duration of spaces, but it turns out, most words don't actually contain 'silence' phonemes.

However, if I insert something like '...' between words, it completely messes up / changes the spectrogram.

Is there a token I can insert that adds in silence without altering the mels in any other way than to add silence? If not, would you be able to point me to the part of the code where I could inject my own silence after a phoneme?

Thanks!

I'm working on a little program that will allow me to insert pauses, and alter the length and pitch of words / phonemes with a user interface and this would be very helpful!

I was also wondering if you knew the meaning of the duration values. In that, is there a way to convert those values to milliseconds. For example, if I want a word to last exactly 2 seconds, if I know the value of 1.0 duration in ms, I can easily figure out what constant to multiply the durations for that word by for it to last the length of time I want.

Same question for pitch; is it possible to target a specific fundamental frequency for a given phoneme? (which would require knowing the base fundamental frequency generated by the network)

Update: I managed to be able to align the phonemes to a grid: https://vocaroo.com/1d2EZ8aXR8AF

joseluismoreira commented 3 years ago

Thanks for the info, I will check that out!

I have around 7,000 samples of my voice and I trained at 48Khz. 11 hours 45 mins.

It is my own dataset that is about 4500 sentences from LJSpeech corpus, 500 from Alice In Wonderland, and 2000 questions from Wikipedia.

Hey @jmasterx amazing work here. Thanks for the insights. Your results look great. I am wondering how was your dataset collected? I am a beginner in tts area, so looking for some best pratices.. Could you describe it please? How many samples do you think are enough ? For many tts datasets, except ljspeech, I couldnt find so many hours from the same speaker. I directed the question for @jmasterx, but please anyone feel free to contribute. Thanks

jmasterx commented 3 years ago

@joseluismoreira Hi

My dataset was collected by me speaking into a Rode NT1 microphone.

I used a tool that I wrote to make it easier to record the samples. You can find it in this repo https://github.com/jmasterx/TextToSpeech/tree/main/TextToSpeechTools/Metadata along with the JoshCustom csv which is my own metadata file used for this corpus.

The data was recorded at 16 bit, 96 Khz then downsampled to 48 Khz. When I trained this iteration, it was trained with the samples at peak normalization, however I am now getting better results by not doing this.

However this model you hear here is very noisy.

The new one I am training, with the same samples, I have processed as follows:

Noise suppression 3db compression Normalize to -16 lufs EQ out all frequencies below 70 Hz.

I have attached my hparams for the new way I'm training which addresses hop size, max freq of spectrograms, etc, for 48Khz.

hparams.zip

joseluismoreira commented 3 years ago

@jmasterx Thank you very much for the detailed answer. It will be very useful for me :)

lukacupic commented 2 years ago

@jmasterx Have you been able to insert pauses into the text? If so, could you please point me in some direction?