as-ideas / TransformerTTS

🤖💬 Transformer TTS: Implementation of a non-autoregressive Transformer based neural network for text to speech.
https://as-ideas.github.io/TransformerTTS/
Other
1.13k stars 227 forks source link

Issues replicating the examples #51

Open Bardo-Konrad opened 4 years ago

Bardo-Konrad commented 4 years ago

My predict.py:

from utils.config_manager import ConfigManager
from utils.audio import Audio
from scipy.io.wavfile import write

config_loader = ConfigManager('ljspeech_autoregressive_transformer\standard', model_kind='autoregressive')
audio = Audio(config_loader.config)
model = config_loader.load_model()
was = 'President Trump met with other leaders at the Group of twenty conference..'
out = model.predict(was)

# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)
#print(wav)
samplerate = 22050; 
was = "".join(x for x in was if x.isalnum())
write(was+".wav", samplerate, wav)

I changed Architecture of autoregressive_config.yaml to

# ARCHITECTURE 
decoder_model_dimension: 256
encoder_model_dimension: 512
decoder_num_heads: [4, 4, 4, 4]  # the length of this defines the number of layers
encoder_num_heads: [4, 4, 4, 4]  # the length of this defines the number of layers
encoder_feed_forward_dimension: 1024
decoder_feed_forward_dimension: 1024
decoder_prenet_dimension: 256
encoder_prenet_dimension: 512
encoder_max_position_encoding: 1000
decoder_max_position_encoding: 10000
postnet_conv_filters: 256
postnet_conv_layers: 5                  
with_stress: true
postnet_kernel_size: 5
encoder_dense_blocks: 4
decoder_dense_blocks: 4
normalizer: 'WaveRNN'
encoder_attention_conv_filters: 512
decoder_attention_conv_filters: 512
encoder_attention_conv_kernel: 3
decoder_attention_conv_kernel: 3

And I got the attached wave file PresidentTrumpmetwithotherleadersattheGroupoftwentyconference.zip. Not the same as https://as-ideas.github.io/TransformerTTS/.

What is missing?

cfrancesco commented 4 years ago

Hi, did you use a pretrained model? Which version of the repo are you using (which commit)? It might be samples from an older model file than the most recent pretrained model available.

edit: I noticed you added with stress and normalizer to the config. These normally are in the data_config.yaml. Also, I don't think that stress is available with models trained with the wavernn preprocessing, so you might be using a recent version (commit) of the repo instead of the right one. Please checkout the commit next to the model file in the table. Also, I discourage using the autoregressive model: it is by nature unstable (in order to make it work I had to add a comma in the sentence after "trump", (which is not there anymore due to comparison with ForwardTacotron)) and has "noise" injected in the decoder, hence results will vary varying the random seed.

edit: listening to the audio, it might actually be that youre phonemizing with stress, but you shouldnt.

cfrancesco commented 4 years ago

Also, if you're interested in replicating the results using our pretrained models, you can just try the Colab Notebooks.

Bardo-Konrad commented 4 years ago

Also, if you're interested in replicating the results using our pretrained models, you can just try the Colab Notebooks.

I used https://colab.research.google.com/github/as-ideas/TransformerTTS/blob/master/notebooks/synthesize_forward_melgan.ipynb

I got only two warning messages

WARNING: could not retrieve git hash. Command '['git', 'describe', '--always']' returned non-zero exit status 128.
WARNING: could not check git hash. Command '['git', 'describe', '--always']' returned non-zero exit status 128.
restored weights from ljspeech_melgan_forward_transformer/melgan/forward_weights/ckpt-179 at step 895000

I changed the sentence to 'Hello, how are you?'

I got Herunterladen.zip.

Sounds as bad as my sample, yet this time I followed the colab.

Which improvement do you suggest, now that we see the exact same approach?

cfrancesco commented 4 years ago

Sounds fine to me. This is inverted with Griffin-Lim algo, sound quality is expected to be low. You need to follow the next steps in the notebook and convert it using with the vocoder.