Closed turian closed 3 years ago
Any idea on how this could be done?
I have a similar problem. Basically I need to generate an speech that matches an original speech transcription.
It is possible using SpeedySpeech, GlowTTS or AlignTTS models but the API is not exposed for this purpose. Also, the models need to be trained with Phonemes instead of Graphgemes for which we revoke the support with the version 0.0.14 until we find an alternative to espeak
. You can contribute to the discussion here #492.
If all that is above is solved, then you just need to pass the durations to the model and use them instead of the duration predictor. Basically, it requires some level of coding and PRing to TTS. If you want to do that I can help you with it.
As I am new to almost everything in TTS, do you have some articles or papers that explain when Phonemes is used instead of Graphemes or what is best/cons about each one. I am asking this to firstly try to understand the whole thing before trying to start coding.
PS: Ahh.. and of course, probably I will need your help =)
One thing that is not clear to me is that solutions I have to pass the phonemes duration during training time or I can just pass in inference time with the trained model.
@erogol just curious, if you currently pass graphemes and their durations, will that work as expected? Where could I see some docs on that?
No it'd not work. To make it work you need to edit the code.
There is no doc for it, there is only the code.
I'm trying to understand this piece of code from TTS/tts/glow_tts.py
, because looking in the code here is where he gets the duration from the DurationPredictor
.
# compute output durations
w = (torch.exp(o_dur_log) - 1) * x_mask * self.length_scale
w_ceil = torch.ceil(w)
y_lengths = torch.clamp_min(torch.sum(w_ceil, [1, 2]), 1).long()
y_max_length = None
@erogol can you help us giving some hints on how should we do?
Another doubt is passing arguments to models, because the only way that I see is by config file.
that part of the code converts log scale durations to linear scale before computing the model outputs.
probably you need to write your own inference
function that takes the durations you provide instead of computing by the duration model.
I'm stucked trying to understand what kind of information is each element of the array w_ceil
, if it is seconds or something.
Basically I understood that each element is the duration of a phoneme, but I'm not understanding how I can manipulate this array to make it match the expected text duration or word duration that I need.
each value in w_ceil
is the duration of the corresponding phoneme. So, just edit it as you like.
So, my problem is bit different from @turian, I have only the whole speech duration. And I'm thinking how I can approach this to make w_ceil
duration matches my, because I don't know exacatly if it is seconds or miliseconds, etc.
@tuliomagalhaes I would be interested even if you can get your use-case to work. Some of my utterances are longer than typical TTS and I would like longer TTS utterances that sound relatively natural
predicted durations tell how many output frames each character produces. So checking the audio parameters, each output frame corresponds to a certain duration time, around 12ms. You can estimate what each duration value corresponds to in time.
A lazy way is to produce the normal output speech, take its length and divide by the total duration so that you compute what in time domain each duration unit corresponds to.
@turian this is what I have done:
def __calculate_duration(self, o_dur_log, x_mask, duration, sample_rate, hop_length):
w = self.__transform_duration_log_to_linear(o_dur_log, x_mask, self.length_scale)
inference_duration = (torch.sum(w).item() / (sample_rate / hop_length)) + 0.117
diff_duration = (duration - inference_duration) - 0.117
new_length_scale = self.length_scale + (diff_duration / inference_duration)
return self.__transform_duration_log_to_linear(o_dur_log, x_mask, new_length_scale)
def __transform_duration_log_to_linear(self, o_dur_log, x_mask, length_scale):
w = (torch.exp(o_dur_log) - 1) * x_mask * length_scale
return torch.ceil(w)
Basically I am passing the expected duration
, sample_rate
and hop_length
(I get it from TTS.tts.utils.audio.AudioProcessor
) to inference
method of glowtts and I use it to preview the Mel Spectrogram seconds to know if I need to increase or decrease the length_scale
(it starts to 1.0).
The only problem that I have right now is I don't know why the preview seconds of Mel Spectrogram is different from the generated waveform. What I see doing some tests is the difference in seconds is from 0.117s
(maybe vocoder is adding some extra seconds?) and I add it to the Mel Spectrogram calculation to compensate this difference.
@erogol do you know why we have this difference from Mel Spectrogram and Waveform?
@tuliomagalhaes is this per-word or per-phone duration? I'm curious to talk more, if you don't mind if we email. I'm lastname at gmail dot com
Check your email to confirm if you received.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.
Is your feature request related to a problem? Please describe.
I have phonemes and their starts and end times, in a textgrid file. I would like to synthesize text using these phonemes and durations.
This is particularly important when you want a particular cadence to the speech.
Describe the solution you'd like
A tts interface where I can pass phonemes and their start and end times.
Describe alternatives you've considered
Run using text, force align, and DTW.