NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
854 stars 187 forks source link

Inference troubles on Windows #52

Open camjac251 opened 4 years ago

camjac251 commented 4 years ago

I wanted to try and synthesize a short sample using a model I've been training before training but I think I'm running into some more issues :/

I ran conda install -c conda-forge notebook but then decided on conda install -c conda-forge jupyterlab, since it has both the new lab and notebook. When opening "inference.ipynb", I started to run the cells one by one

First block gave this error

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-7-20c49233a480> in <module>
     11 import scipy as sp
     12 from scipy.io.wavfile import write
---> 13 import pandas as pd
     14 import librosa
     15 import torch

ModuleNotFoundError: No module named 'pandas'

just a simple missing dependency, so I ignored it and moved on to see the rest of the code.

  File "<ipython-input-8-a10e6c979de1>", line 2
    angle = np.radians(angle)
    ^
IndentationError: unexpected indent

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-9f5d81c0336e> in <module>
----> 1 hparams = create_hparams()

NameError: name 'create_hparams' is not defined

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-a2ae391905a9> in <module>
----> 1 stft = TacotronSTFT(hparams.filter_length, hparams.hop_length, hparams.win_length,
      2                     hparams.n_mel_channels, hparams.sampling_rate, hparams.mel_fmin,
      3                     hparams.mel_fmax)

NameError: name 'TacotronSTFT' is not defined

And then I stopped it at Load models. Are these not supposed to show? It feels like I"m using the wrong project or something

CookiePPP commented 4 years ago

just a simple missing dependency, so I ignored it and moved on to see the rest of the code.

The code stops at the missing dependency and so everything else in the first block has not been imported.

import librosa
import torch

from hparams import create_hparams
from model import Tacotron2, load_model
from waveglow.denoiser import Denoiser
from layers import TacotronSTFT
from data_utils import TextMelLoader, TextMelCollate
from text import cmudict, text_to_sequence
from mellotron_utils import get_data_from_musicxml

was never run.


I'd suggest installing any dependencies you can. Most are required to run the notebook.

camjac251 commented 4 years ago

Ah, I see, that makes sense. I'll go ahead and install the missing dependency then. Hopefully it'll like the latest version. Thank you :)

camjac251 commented 4 years ago

I opted to use version 0.25.3 of pandas since it was around the same time this project was uploaded. I had to also add in the existing version of numpy or else it would update it to latest and cause issues I believe conda install pandas=0.25.3 numpy=1.16.4.

No errors anymore except for IndentationError: unexpected indent, harmless. I guess I just rename checkpoint_##### to mellotron_ljs.pt, or is there a conversion process of checkpoints to the pt extension for inference? Going to attempt to train waveglow next before running the full inference code

CookiePPP commented 4 years ago

It's easier to rename

checkpoint_path = "models/mellotron_libritts.pt"

to

checkpoint_path = "outdir/checkpoint_XXXXXXX"

inside the notebook. I don't believe any conversion is required to test the checkpoint.

camjac251 commented 4 years ago

I must be missing something about the training procedure. I followed the waveglow readme on training instructions because I thought you use a seperate mellotron and waveglow model to synthesize results. But when I try to train I get

(condaenv): python train.py -c config.json
Traceback (most recent call last):
  File "train.py", line 39, in <module>
    from mel2samp import Mel2Samp
  File "C:\Users\camja\Desktop\mellotron\waveglow\mel2samp.py", line 38, in <module>
    from tacotron2.layers import TacotronSTFT
ModuleNotFoundError: No module named 'tacotron2.layers'
camjac251 commented 4 years ago

If I try to run it with the waveglow model available on the readme, I get this error

C:\Users\camja\anaconda3\envs\mello\lib\site-packages\torch\serialization.py:593: SourceChangeWarning: source code of class 'torch.nn.modules.container.ModuleList' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
  warnings.warn(msg, SourceChangeWarning)

which might be breaking it. Because this is my result image and the audio it created https://voca.ro/kFaOGGxbLAj

CookiePPP commented 4 years ago

@camjac251 The Predicted Mel is from Mellotron therefore Mellotron is the one acting up here.

CookiePPP commented 4 years ago

@camjac251 Did you start your Mellotron model from scratch? The Source rhythm should be a diagonal line where each text input would match to a part of the output time however this requires it to be trained to a decent degree without using pretrained weights as the starting point.

camjac251 commented 4 years ago

Yeah I did. I started from nothing with the LJS speech dataset. It initially trained and had to be restarted every now and then so I would train with this python train.py --output_directory=outdir --log_directory=logdir --checkpoint_path outdir/checkpoint_##### every time. I did see another issue about quality loss with resuming #30 so I changed use_saved_learning_rate=True, but didn't touch ignore_layers=['speaker_embedding.weight']. I trained it up to 16,000 iterations when it started to look good on the predicted. individualImage individualImage2 individualImage3 individualImage4

CookiePPP commented 4 years ago

@camjac251 Refer to the notebook. https://github.com/NVIDIA/mellotron/blob/master/inference.ipynb You can see that there is a Green/Yellow line in the last graph. That's you alignment aka how well the model has learned to link the text/f0 to the audio. Your tensorboard output shows the the model is still learning alignment (top image in your comment).

camjac251 commented 4 years ago

Those breaks are when my machine turned off during training, it's happened at least 20 times I think during training. I can only get 6 hours of constant training in a row per day before short bursts of restarting it.

camjac251 commented 4 years ago

I let it run longer and tried today to get a result image

@rafaelvalle Is this a ok to ignore with waveglow?

C:\Users\camja\anaconda3\envs\mello\lib\site-packages\torch\serialization.py:593: SourceChangeWarning: source code of class 'torch.nn.modules.container.ModuleList' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
  warnings.warn(msg, SourceChangeWarning)

I feel like it might be why my audio is sounding like this https://voca.ro/bT4CP3It1BV

Edit: If waveglow requires pytorch 1.0, then how come mellotron doesn't impose this same requirement in the readme if tacotron2 does.

rafaelvalle commented 4 years ago

The issue comes from the mel-spectrogram you are producing. Your model hasn't learned to attend yet.

camjac251 commented 4 years ago

It's been a while since I last tried training, but that was after 2 weeks or so of constant training (at least uninterrupted, there were other sessions before) I thought maybe setting the suggested settings from #30 could help but it might've hurt instead. I changed in my hqparams use_saved_learning_rate=True and ignore_layers=[]

Do commas and breaths mess up the alignment of data with training? I have quite a few samples that repeat half of the word before saying the full word, but I tried to add in those to the transcribed part and commas when a thought is changed mid sentence.

rafaelvalle commented 4 years ago

Can you share a screenshot of your tensorboard logs with training and validation curves, the attention maps and predicted mel-spectrograms ?

camjac251 commented 4 years ago

I'm not sure where to find the attention map. I went back and forth with the hqparam settings and think I even started over with the LJSpeech model as the starter and this might be the result of that, I can't remember. I started with warm start and that would run for a few weeks, I believe.

Shadow_2020-07-09_21-11-48 Shadow_2020-07-09_21-14-18

rafaelvalle commented 4 years ago

The validation loss is going up, showing evidence that your model is overfitting.

camjac251 commented 4 years ago

Does it need just more time and training data?

rafaelvalle commented 4 years ago

Take a look at issues related to overfitting in the tacotron2 repo. https://github.com/NVIDIA/tacotron2

camjac251 commented 4 years ago

Ok thank you. I'll look for answers there. I've been able to generate audio that sounded like the voice I was training with but some words in the sentence sounded a bit slurred or were missing in the generated audio. I was worried that it might've been my training set and that more time training wouldn't have helped.

rafaelvalle commented 4 years ago

Augment your data if you can.