About future updates - Githubissues

CorentinJ commented 5 years ago

I've been working on reproducing 1806.04558 as part of my master's thesis. I don't know if you're familiar with the paper, but it describes a framework based on Tacotron 2 with an additional speaker encoder model that derives speaker embeddings from a short utterance, which is used to condition Tacotron and achieve voice cloning.

When I started working on this project, the best Tacotron 2 implementation I could find was https://github.com/Rayhane-mamah/Tacotron-2. I wasn't too happy about having to integrate a tensorflow model in my pytorch project but in the end it's what I rolled with. I wanted the final product to work in real time or near real time so I looked into faster vocoders than WaveNet. Someone directed me to this repo and I've been using your vocoder since. I'm planning to open source the project in just a few days.

But now I see that you've also included Tacotron and that you're planning to do more than that. My question is whether you had the intention to work on that very same paper I linked. If it's the case, we could possibly collaborate to merge my work to your repo, depending on if you like the work I've done. If you had entirely something else in mind, then I will just keep my implementation going in my own repo. Note that if your Tacotron implementation is doing good (I haven't taken the time to look at it yet, I'm busy trying to finish my work in time), I will likely use it instead of the tensorflow one I currently have, so as to have a full pytorch repo.

fazlekarim commented 5 years ago

I’m actually really excited to see your implementation of that paper. Let me know when you open source your code. It would definitely help the community

KonstantinosMarko commented 5 years ago

@CorentinJ I was also using Tacotron-2 from Rayhane. I was trying to train WaveRNN on the mel-specs produced from the preprocessing of the Rayhane and also the gta mel specs with no success. This implementation, in the preprocessing, normalizes the waveform before turning it into a mel spectrogram. If you find any useful transformation for the mels produced on Rayhane in order to combine Tacotron-2 and WaveRNN I would like to know. Thank you and good luck on your project!

CorentinJ commented 5 years ago

I did find a few things (some of which I mention in #90). I'm currently running 8 jobs on 8 GPUs with different options to see what works and what doesn't:

normalization: I've compared fatchord's melspectrogram function with rayhane's. I believe that they are exactly the same up to a linear transformation. Rayhane scales the spectrogram to [-max_abs_value, max_abs_value] and fatchord to [0, 1]. For simplicity, I simply divide my spectrogram by max_abs_value to obtain a range of [-1, 1], and that seems to do just fine.
padding audio vs not padding audio: in his preprocessing, rayhane pads the wavs (you can choose either on the right, or on both sides equally) after computing the spectrogram so that you have len(wav) == mel.shape[1] * hop_length. In fatchord's implementation, the preprocessing does not pad the wavs at all. I still don't know if this has any effect at all. It seems to me that a missing pad on the audio from left means that the waveform will be unaligned with the mel. I trust fatchord's implementation to be correct in not adding any padding at preprocessing time (maybe it's done later).
pre-emphasis vs no pre-emphasis: rayhane computes the mels from wavs with pre-emphasis (you can disable that), it's good to check if your target wavs also have the pre-emphasis, and if you do de-emphase after generation. Fatchord has the pre-emphasis function in dsp, but does not use it atm.
RAW vs MOL: I've personally not seen yet the advantages of MOL. Given how much longer it takes for training, I should have been sticking with RAW.

KonstantinosMarko commented 5 years ago

Thank you for your answer.

I just observed that in preprocess.py, fatchord does the following:

def convert_file(path) : y = load_wav(path) peak = np.abs(y).max() if hp.peak_norm or peak > 1.0: y /= peak

and after this transformation he extracts the mel spectrogram.

CorentinJ commented 5 years ago

So does rayhane, it's just volume normalization. You're supposed to normalize before you generate the mel, then save the normalized audio (with rayhane's code). Then you won't need to normalize again with fatchord's code.

KonstantinosMarko commented 5 years ago

awww i see. ok thank you :)

CorentinJ commented 5 years ago

Here is my project: https://github.com/CorentinJ/Real-Time-Voice-Cloning

fazlekarim commented 5 years ago

@CorentinJ, just went through your youtube video on it. AMAZING WORK! great job!

fatchord / WaveRNN

About future updates #92