NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.07k stars 1.38k forks source link

Increasing the possible length of sentences. #184

Closed RoelVdP closed 5 years ago

RoelVdP commented 5 years ago

Using https://github.com/NVIDIA/tacotron2 inference.ipynb and the default max_decoder_steps=1000, I can make sentences that are about 5-10 words long with good quality.

Increasing this to =2000 seems to allow a bit longer sentences. Increasing it further makes the audio longer (like 30 seconds where it should be 7 etc), and odd (starts of well, then goes odd voice).

Used Nvidia GPU has > 10GB, but only a small fraction of this seems used during the process, so it is not hardware related.

Using longer sentences and/or a smaller max_decoder_steps also generates the "Warning! Reached max decoder steps".

How to increase the possible length of sentences for TTS? Doing per-sentence processing is not an option as slightly longer sentences already run into issues.

pravn commented 5 years ago

I think there are a few ideas if you want to synthesize larger sequences but we might need to write extra code to set it up.

1) (This is a bit hacky and probably the easiest) Break up the sentences for > (lets say) 5 words. Synthesize both parts separately. An improved variant of this might be to condition on first five words with some sort of recurrent context, with or without attention. 1) Try the 'r' trick like in the original Tacotron. This way, we can generate more than one frame per decoder timestep (although, again, we might need to mess with the #hidden units to ensure quality, etc.) 2) Generate a lower resolution spectrogram and then follow it up (progressively) with superresolution. The paper by Tachibana-Uenoyama has a flavor of it in what they call the Spectrogram Superresolution network. This kind of hierarchical growing is used in many different use cases. https://arxiv.org/pdf/1710.08969.pdf

pravn.wordpress.com

On Thu, Apr 11, 2019 at 11:12 PM Roel Van de Paar notifications@github.com wrote:

Using https://github.com/NVIDIA/tacotron2 inference.ipynb and the default max_decoder_steps=1000, I can make sentences that are about 5-10 words long with good quality.

Increasing this to =2000 seems to allow a bit longer sentences. Increasing it further makes the audio longer (like 30 seconds where it should be 7 etc), and odd (starts of well, then goes odd voice).

Used Nvidia GPU has > 10GB, but only a small fraction of this seems used during the process, so it is not hardware related.

Using longer sentences and/or a smaller max_decoder_steps also generates the "Warning! Reached max decoder steps".

How to increase the possible length of sentences for TTS? Doing per-sentence processing is not an option as slightly longer sentences already run into issues.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/tacotron2/issues/184, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhnVUW1WOUEV19rdcqWbj7dCrs6F9Hcks5vgCPOgaJpZM4crZdB .

pravn commented 5 years ago

Also, do you know why it breaks at larger #timesteps? Does the attention curve break at some length? pravn.wordpress.com

On Thu, Apr 11, 2019 at 11:27 PM Praveen Narayanan prav.narayanan@gmail.com wrote:

I think there are a few ideas if you want to synthesize larger sequences but we might need to write extra code to set it up.

1) (This is a bit hacky and probably the easiest) Break up the sentences for > (lets say) 5 words. Synthesize both parts separately. An improved variant of this might be to condition on first five words with some sort of recurrent context, with or without attention. 1) Try the 'r' trick like in the original Tacotron. This way, we can generate more than one frame per decoder timestep (although, again, we might need to mess with the #hidden units to ensure quality, etc.) 2) Generate a lower resolution spectrogram and then follow it up (progressively) with superresolution. The paper by Tachibana-Uenoyama has a flavor of it in what they call the Spectrogram Superresolution network. This kind of hierarchical growing is used in many different use cases. https://arxiv.org/pdf/1710.08969.pdf

pravn.wordpress.com

On Thu, Apr 11, 2019 at 11:12 PM Roel Van de Paar < notifications@github.com> wrote:

Using https://github.com/NVIDIA/tacotron2 inference.ipynb and the default max_decoder_steps=1000, I can make sentences that are about 5-10 words long with good quality.

Increasing this to =2000 seems to allow a bit longer sentences. Increasing it further makes the audio longer (like 30 seconds where it should be 7 etc), and odd (starts of well, then goes odd voice).

Used Nvidia GPU has > 10GB, but only a small fraction of this seems used during the process, so it is not hardware related.

Using longer sentences and/or a smaller max_decoder_steps also generates the "Warning! Reached max decoder steps".

How to increase the possible length of sentences for TTS? Doing per-sentence processing is not an option as slightly longer sentences already run into issues.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/tacotron2/issues/184, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhnVUW1WOUEV19rdcqWbj7dCrs6F9Hcks5vgCPOgaJpZM4crZdB .

RoelVdP commented 5 years ago

I found at least part of the problems seem to have been caused by this proposed fix; git submodule update --remote --merge

Which was given for "denoiser module is missing" (and this error is indeed seen in the latest code) in https://github.com/NVIDIA/tacotron2/issues/164

When pulling the tree afresh, not executing that git command above, and disabling the denoiser module in inference.py it now works with at least 30-40 words, and there is no "odd voice" coming.

Now the question is - how to get denoiser to work in the latest code and how to still make it longer then 30-40 words.

@pravn thank you. What is the 'r' trick?

RoelVdP commented 5 years ago

Great news (and hereby resolving my own ticket). Setting max_decoder_steps=2000 in hparams on the cleanly pulled tree (i.e. without the git command mentioned as a fix in #164) and with the denoiser temporarily disabled/remarked, it works even with quite long sentences. And to clarify further; the original issue above was the result, in part (odd voice at end + longer length audio), of the same git command.

@rafaelvalle your input on #164 would be welcome. @pravn thanks, let me know about that 'r' trick.

pravn commented 5 years ago

@RoelVdP - This 'r trick is from the original Tacotron where the decoder generates several frames of speech (==r) at every timestep. Any one (or all of these frames stacked) of these frames is used as the decoder input for the next timestep. In a sense, it is a kind of downsampling scheme to produce more frames per decoder timestep. @r9y9's Tacotron code has it as follows ( https://github.com/r9y9/tacotron_pytorch/blob/master/tacotron_pytorch/tacotron.py )

class Decoder(nn.Module): def init(self, in_dim, r): ... self.prenet = Prenet(in_dim * r, sizes=[256, 128])

(prenet_out + attention context) -> output

... self.proj_to_mel = nn.Linear(256, in_dim * r) self.max_decoder_steps = 200

pravn.wordpress.com

On Fri, Apr 12, 2019 at 12:10 AM Roel Van de Paar notifications@github.com wrote:

Great news (and hereby resolving my own ticket). Setting max_decoder_steps=2000 in hparams on the cleanly pulled tree (i.e. without the git command mentioned as a fix in #164 https://github.com/NVIDIA/tacotron2/issues/164) and with the denoiser temporarily disabled/remarked, it works even with quite long sentences.

@rafaelvalle https://github.com/rafaelvalle your input on #164 https://github.com/NVIDIA/tacotron2/issues/164 would be welcome. @pravn https://github.com/pravn thanks, let me know about that 'r' trick.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/tacotron2/issues/184#issuecomment-482464112, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhnVUMc-PfO5LUWO2AonzYn0T7cFgRGks5vgDFKgaJpZM4crZdB .

vitaly-zdanevich commented 5 years ago

@RoelVdP how did you disable denoiser?

shoegazerstella commented 5 years ago

@RoelVdP how did you disable denoiser?

In inference.ipynb just comment this out:

audio_denoised = denoiser(audio, strength=0.01)[:, 0]
ipd.Audio(audio_denoised.cpu().numpy(), rate=hparams.sampling_rate)
vitaly-zdanevich commented 5 years ago

@pravn this fix will be merged? I tried to alter model.py according to your code example but looks like I need to alter some other code also. Now only 11 seconds is maximum, with max_decoder_steps=2000 I can get 17 seconds, but for some texts I hear very drunk voice.

pravn commented 5 years ago

These changes are are from the original Tacotron paper (not Tacotron 2), so I would think that usage would be external to the current implementation. At any rate, this sort of thing needs a lot of testing.

pravn.wordpress.com

On Thu, May 16, 2019 at 9:28 AM Vitaly Zdanevich notifications@github.com wrote:

@pravn https://github.com/pravn this fix will be merged? I tried to alter model.py according to your code example but looks like I need to alter some other code also. Now only 11 seconds is maximum, with max_decoder_steps=2000 I can get 17 seconds, but for some texts I hear very drunk voice.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/tacotron2/issues/184?email_source=notifications&email_token=AAEGOVI7BHOOINAMEKLOTW3PVWDTBA5CNFSM4HFNS5A2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVSLH7A#issuecomment-493138940, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEGOVIEIPX2GKIOG75KEHLPVWDTBANCNFSM4HFNS5AQ .

pravn commented 5 years ago

Also, do you have a link to your changes? What was the r value used, corpus, etc.? Did you use a postprocessor net to clean up generated utterances?

pravn.wordpress.com

On Thu, May 16, 2019, 10:01 AM Praveen Narayanan prav.narayanan@gmail.com wrote:

These changes are are from the original Tacotron paper (not Tacotron 2), so I would think that usage would be external to the current implementation. At any rate, this sort of thing needs a lot of testing.

pravn.wordpress.com

On Thu, May 16, 2019 at 9:28 AM Vitaly Zdanevich notifications@github.com wrote:

@pravn https://github.com/pravn this fix will be merged? I tried to alter model.py according to your code example but looks like I need to alter some other code also. Now only 11 seconds is maximum, with max_decoder_steps=2000 I can get 17 seconds, but for some texts I hear very drunk voice.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/tacotron2/issues/184?email_source=notifications&email_token=AAEGOVI7BHOOINAMEKLOTW3PVWDTBA5CNFSM4HFNS5A2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVSLH7A#issuecomment-493138940, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEGOVIEIPX2GKIOG75KEHLPVWDTBANCNFSM4HFNS5AQ .

v-nhandt21 commented 4 years ago

I tried to alter model.py according to your code example but looks like I need to alter some other code also. Now only 11 seconds is maximum, with max_decoder_steps=2000 I can get 17 seconds, but for some texts I hear

I have a same problem, can you share more how you solve it, I wonder how the sample test of Nvidia have many len. Should I train again if I change "max_decoder_steps=2000", or it just affects in inference step

ErfolgreichCharismatisch commented 3 years ago

Tutorial: Training on GPU with Colab, Inference with CPU on Server here.