fatchord / WaveRNN

WaveRNN Vocoder + TTS
https://fatchord.github.io/model_outputs/
MIT License
2.13k stars 698 forks source link

About future updates #92

Closed CorentinJ closed 5 years ago

CorentinJ commented 5 years ago

I've been working on reproducing 1806.04558 as part of my master's thesis. I don't know if you're familiar with the paper, but it describes a framework based on Tacotron 2 with an additional speaker encoder model that derives speaker embeddings from a short utterance, which is used to condition Tacotron and achieve voice cloning.

When I started working on this project, the best Tacotron 2 implementation I could find was https://github.com/Rayhane-mamah/Tacotron-2. I wasn't too happy about having to integrate a tensorflow model in my pytorch project but in the end it's what I rolled with. I wanted the final product to work in real time or near real time so I looked into faster vocoders than WaveNet. Someone directed me to this repo and I've been using your vocoder since. I'm planning to open source the project in just a few days.

But now I see that you've also included Tacotron and that you're planning to do more than that. My question is whether you had the intention to work on that very same paper I linked. If it's the case, we could possibly collaborate to merge my work to your repo, depending on if you like the work I've done. If you had entirely something else in mind, then I will just keep my implementation going in my own repo. Note that if your Tacotron implementation is doing good (I haven't taken the time to look at it yet, I'm busy trying to finish my work in time), I will likely use it instead of the tensorflow one I currently have, so as to have a full pytorch repo.

fazlekarim commented 5 years ago

I’m actually really excited to see your implementation of that paper. Let me know when you open source your code. It would definitely help the community

KonstantinosMarko commented 5 years ago

@CorentinJ I was also using Tacotron-2 from Rayhane. I was trying to train WaveRNN on the mel-specs produced from the preprocessing of the Rayhane and also the gta mel specs with no success. This implementation, in the preprocessing, normalizes the waveform before turning it into a mel spectrogram. If you find any useful transformation for the mels produced on Rayhane in order to combine Tacotron-2 and WaveRNN I would like to know. Thank you and good luck on your project!

CorentinJ commented 5 years ago

I did find a few things (some of which I mention in #90). I'm currently running 8 jobs on 8 GPUs with different options to see what works and what doesn't:

KonstantinosMarko commented 5 years ago

Thank you for your answer.

I just observed that in preprocess.py, fatchord does the following:

def convert_file(path) : y = load_wav(path) peak = np.abs(y).max() if hp.peak_norm or peak > 1.0: y /= peak

and after this transformation he extracts the mel spectrogram.

CorentinJ commented 5 years ago

So does rayhane, it's just volume normalization. You're supposed to normalize before you generate the mel, then save the normalized audio (with rayhane's code). Then you won't need to normalize again with fatchord's code.

KonstantinosMarko commented 5 years ago

awww i see. ok thank you :)

CorentinJ commented 5 years ago

Here is my project: https://github.com/CorentinJ/Real-Time-Voice-Cloning

fazlekarim commented 5 years ago

@CorentinJ, just went through your youtube video on it. AMAZING WORK! great job!