as-ideas / ForwardTacotron

⏩ Generating speech in a single forward pass without any attention!
https://as-ideas.github.io/ForwardTacotron/
MIT License
578 stars 113 forks source link

Vocoder ParallelWaveGAN #39

Closed lysektomas closed 3 years ago

lysektomas commented 3 years ago

Hi, I have learned model in our language using your repository. Melgan as vocoder was learning very slow, so i used https://github.com/kan-bayashi/ParallelWaveGAN

Have you tried to use it with this vocoder?

I have learned new model using your default configuration and this configuration from ParallelWaveGAN. Did i miss something?

Final audio is here "Hello how are you". When using grifim limm as vocoder, it is working fine.

I don't see fft_bins or bits in ParallelWavegan configuration, is this the problem?

Thanks a lot for your work! Tom

lysektomas commented 3 years ago

I have tried default melgan vocoder (from https://github.com/seungwonpark/melgan). I have synthesized hello world 4x times using my model.

Output from melgan vocoder is like reversed https://soundcloud.com/tom-lysek-832824742/hello-world-melgan/s-liaMpH3Ijx5

Output from grifim limm vocoder is great.

This is spectrogram from mels spectrogram

Transformation from mels to wav was performed using this code

` m = (torch.tensor(mels).unsqueeze(0)) with torch.no_grad(): if len(m.shape) == 2: m = m.unsqueeze(0) wav = model.inference(m).cpu().numpy()

ipd.Audio(wav, rate=hp.sample_rate) `

and mels is m from synthesize funkction in notebook_utils/synthesize.py

Do you have any thoughts? Tom

cschaefer26 commented 3 years ago

Hi, the synthesize.py from notebook_utils already outputs mels in the correct format for melgan, maybe you are expanding too much? In the colab notebook its using the default melgan model for LJSpeech. PWGAN should actually work similarly I believe (it has the same preprocessing as melgan). Also, if you call gen_forward.py use the melgan flag to produce .mel files that are in the correct format for melgan already (see README).

Thats the synthesize.py mel processing for melgan:

m = torch.tensor(m).unsqueeze(0).cuda()

lysektomas commented 3 years ago

There are same result when i use gen_forward.py to produce .mel file and then convert it to wav with inference.py from melgan as if I was using jupyter.

I have tried to use your function synthesize with this setup

obrazek

And result is same.

If i change this bad vocoder for default vocoder (from your example) result is this https://soundcloud.com/tom-lysek-832824742/hello-world-default-melgan/s-zpnjVYgxrAu

So i assume that this vocoder was learned with some error.

I was using this hyperparameters (default hyperparameters):

obrazek

left is hparams from ForwardTacotron right is default.yaml from melgan repository.

Did I miss something? Is there some misconfiguration? Only difference is filter_length vs. fft_bins, but I don't think that is the difference :/

I have used sox to see if there is problem with input files but they have same parameters:

obrazek

1.wav is my own dataset, LJXXX.wav is LJ speech.

cschaefer26 commented 3 years ago

Hmm I don't see a mismatch. Does the pretrained ForwardTacotron model work for you with the standard melgan?

lysektomas commented 3 years ago

i have used vocgan and everything was fine. Maybe model was learned badly.