coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
35.17k stars 4.29k forks source link

[Feature request] Synthesize given phonemes and durations #485

Closed turian closed 3 years ago

turian commented 3 years ago

Is your feature request related to a problem? Please describe.

I have phonemes and their starts and end times, in a textgrid file. I would like to synthesize text using these phonemes and durations.

This is particularly important when you want a particular cadence to the speech.

Describe the solution you'd like

A tts interface where I can pass phonemes and their start and end times.

Describe alternatives you've considered

Run using text, force align, and DTW.

tuliomagalhaes commented 3 years ago

Any idea on how this could be done?

I have a similar problem. Basically I need to generate an speech that matches an original speech transcription.

erogol commented 3 years ago

It is possible using SpeedySpeech, GlowTTS or AlignTTS models but the API is not exposed for this purpose. Also, the models need to be trained with Phonemes instead of Graphgemes for which we revoke the support with the version 0.0.14 until we find an alternative to espeak. You can contribute to the discussion here #492.

If all that is above is solved, then you just need to pass the durations to the model and use them instead of the duration predictor. Basically, it requires some level of coding and PRing to TTS. If you want to do that I can help you with it.

tuliomagalhaes commented 3 years ago

As I am new to almost everything in TTS, do you have some articles or papers that explain when Phonemes is used instead of Graphemes or what is best/cons about each one. I am asking this to firstly try to understand the whole thing before trying to start coding.

PS: Ahh.. and of course, probably I will need your help =)

tuliomagalhaes commented 3 years ago

One thing that is not clear to me is that solutions I have to pass the phonemes duration during training time or I can just pass in inference time with the trained model.

turian commented 3 years ago

@erogol just curious, if you currently pass graphemes and their durations, will that work as expected? Where could I see some docs on that?

erogol commented 3 years ago

No it'd not work. To make it work you need to edit the code.

There is no doc for it, there is only the code.

tuliomagalhaes commented 3 years ago

I'm trying to understand this piece of code from TTS/tts/glow_tts.py, because looking in the code here is where he gets the duration from the DurationPredictor.

# compute output durations
w = (torch.exp(o_dur_log) - 1) * x_mask * self.length_scale
w_ceil = torch.ceil(w)
y_lengths = torch.clamp_min(torch.sum(w_ceil, [1, 2]), 1).long()
y_max_length = None

@erogol can you help us giving some hints on how should we do?

Another doubt is passing arguments to models, because the only way that I see is by config file.

erogol commented 3 years ago

that part of the code converts log scale durations to linear scale before computing the model outputs.

probably you need to write your own inference function that takes the durations you provide instead of computing by the duration model.

tuliomagalhaes commented 3 years ago

I'm stucked trying to understand what kind of information is each element of the array w_ceil, if it is seconds or something. Basically I understood that each element is the duration of a phoneme, but I'm not understanding how I can manipulate this array to make it match the expected text duration or word duration that I need.

erogol commented 3 years ago

each value in w_ceil is the duration of the corresponding phoneme. So, just edit it as you like.

tuliomagalhaes commented 3 years ago

So, my problem is bit different from @turian, I have only the whole speech duration. And I'm thinking how I can approach this to make w_ceil duration matches my, because I don't know exacatly if it is seconds or miliseconds, etc.

turian commented 3 years ago

@tuliomagalhaes I would be interested even if you can get your use-case to work. Some of my utterances are longer than typical TTS and I would like longer TTS utterances that sound relatively natural

erogol commented 3 years ago

predicted durations tell how many output frames each character produces. So checking the audio parameters, each output frame corresponds to a certain duration time, around 12ms. You can estimate what each duration value corresponds to in time.

A lazy way is to produce the normal output speech, take its length and divide by the total duration so that you compute what in time domain each duration unit corresponds to.

tuliomagalhaes commented 3 years ago

@turian this is what I have done:

def __calculate_duration(self, o_dur_log, x_mask, duration, sample_rate, hop_length):
        w = self.__transform_duration_log_to_linear(o_dur_log, x_mask, self.length_scale)
        inference_duration = (torch.sum(w).item() / (sample_rate / hop_length)) + 0.117
        diff_duration = (duration - inference_duration) - 0.117
        new_length_scale = self.length_scale + (diff_duration / inference_duration)
        return self.__transform_duration_log_to_linear(o_dur_log, x_mask, new_length_scale)

def __transform_duration_log_to_linear(self, o_dur_log, x_mask, length_scale):
        w = (torch.exp(o_dur_log) - 1) * x_mask * length_scale
        return torch.ceil(w)

Basically I am passing the expected duration, sample_rate and hop_length (I get it from TTS.tts.utils.audio.AudioProcessor) to inference method of glowtts and I use it to preview the Mel Spectrogram seconds to know if I need to increase or decrease the length_scale (it starts to 1.0).

The only problem that I have right now is I don't know why the preview seconds of Mel Spectrogram is different from the generated waveform. What I see doing some tests is the difference in seconds is from 0.117s (maybe vocoder is adding some extra seconds?) and I add it to the Mel Spectrogram calculation to compensate this difference.

@erogol do you know why we have this difference from Mel Spectrogram and Waveform?

turian commented 3 years ago

@tuliomagalhaes is this per-word or per-phone duration? I'm curious to talk more, if you don't mind if we email. I'm lastname at gmail dot com

tuliomagalhaes commented 3 years ago

Check your email to confirm if you received.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.