CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.55k stars 8.79k forks source link

Pytorch synthesizer #447

Closed ghost closed 4 years ago

ghost commented 4 years ago

Splitting this off from #370, which will remain for tensorflow2 conversion. I would prefer this route if we can get it to work. Asking for help from the community on this one.

One example of a pytorch-based tacotron is: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2

Another option is to manually convert the code and pretrained models which would be extremely time-consuming, but also an awesome learning experience.

ZeroCool940711 commented 4 years ago

@blue-fish Im trying to run your branch with Pytorch and im getting some errors when trying to run the demo_toolbox.py, not sure if its something on my computer that I messed up but would appreciate some help, the demo_cli.py seems to work but im not completely sure as I havent been able to test it fully, I'm still downloading the datasets to test it later but it does start which is more than what I can say about the original version using Tensorflow xD, this is the error im getting when running the demo_toolbox.py:

D:\ZeroCool\Projects\Python\Otros\Machine Learning\Voice Cloning\Real-Time-Voice-Cloning_pytorch>python demo_toolbox.py
Arguments:
    datasets_root:    None
    enc_models_dir:   encoder\saved_models
    syn_models_dir:   synthesizer\saved_models
    voc_models_dir:   vocoder\saved_models
    low_mem:          False
    seed:             None

Traceback (most recent call last):
  File "demo_toolbox.py", line 39, in <module>
    Toolbox(**vars(args))
  File "D:\ZeroCool\Projects\Python\Otros\Machine Learning\Voice Cloning\Real-Time-Voice-Cloning_pytorch\toolbox\__init__.py", line 63, in __init__
    self.ui = UI()
  File "D:\ZeroCool\Projects\Python\Otros\Machine Learning\Voice Cloning\Real-Time-Voice-Cloning_pytorch\toolbox\ui.py", line 455, in __init__
    fig, self.umap_ax = plt.subplots(figsize=(3, 3), facecolor="#F0F0F0")
  File "D:\Python\lib\site-packages\matplotlib\cbook\deprecation.py", line 451, in wrapper
    return func(*args, **kwargs)
  File "D:\Python\lib\site-packages\matplotlib\pyplot.py", line 1271, in subplots
    fig = figure(**fig_kw)
  File "D:\Python\lib\site-packages\matplotlib\pyplot.py", line 677, in figure
    **kwargs)
  File "D:\Python\lib\site-packages\matplotlib\pyplot.py", line 299, in new_figure_manager
    return _backend_mod.new_figure_manager(*args, **kwargs)
  File "D:\Python\lib\site-packages\matplotlib\backend_bases.py", line 3494, in new_figure_manager
    return cls.new_figure_manager_given_figure(num, fig)
  File "D:\Python\lib\site-packages\matplotlib\backend_bases.py", line 3499, in new_figure_manager_given_figure
    canvas = cls.FigureCanvas(figure)
TypeError: 'NoneType' object is not callable
Error in sys.excepthook:
Traceback (most recent call last):
  File "D:\ZeroCool\Projects\Python\Otros\Machine Learning\Voice Cloning\Real-Time-Voice-Cloning_pytorch\toolbox\__init__.py", line 70, in excepthook
    self.ui.log("Exception: %s" % exc_value)
AttributeError: 'Toolbox' object has no attribute 'ui'

Original exception was:
Traceback (most recent call last):
  File "demo_toolbox.py", line 39, in <module>
    Toolbox(**vars(args))
  File "D:\ZeroCool\Projects\Python\Otros\Machine Learning\Voice Cloning\Real-Time-Voice-Cloning_pytorch\toolbox\__init__.py", line 63, in __init__
    self.ui = UI()
  File "D:\ZeroCool\Projects\Python\Otros\Machine Learning\Voice Cloning\Real-Time-Voice-Cloning_pytorch\toolbox\ui.py", line 455, in __init__
    fig, self.umap_ax = plt.subplots(figsize=(3, 3), facecolor="#F0F0F0")
  File "D:\Python\lib\site-packages\matplotlib\cbook\deprecation.py", line 451, in wrapper
    return func(*args, **kwargs)
  File "D:\Python\lib\site-packages\matplotlib\pyplot.py", line 1271, in subplots
    fig = figure(**fig_kw)
  File "D:\Python\lib\site-packages\matplotlib\pyplot.py", line 677, in figure
    **kwargs)
  File "D:\Python\lib\site-packages\matplotlib\pyplot.py", line 299, in new_figure_manager
    return _backend_mod.new_figure_manager(*args, **kwargs)
  File "D:\Python\lib\site-packages\matplotlib\backend_bases.py", line 3494, in new_figure_manager
    return cls.new_figure_manager_given_figure(num, fig)
  File "D:\Python\lib\site-packages\matplotlib\backend_bases.py", line 3499, in new_figure_manager_given_figure
    canvas = cls.FigureCanvas(figure)
TypeError: 'NoneType' object is not callable

Edit: Just in case someone finds this comment because they are getting the same error this is caused by some problem with matplotlib when installing it using pip install matplotlib, it can be fixed by installing matplotlib using conda install matplotlib, it doesnt matter what version you try to install using pip it will always give you the same error but any version installed with conda will just work.

ghost commented 4 years ago

@ZeroCool940711 Thanks for testing the pytorch branch and reporting the error, and a solution. I also noticed that when running the toolbox on Python 3.8, but I don't get that error on a different computer that has Python 3.7.

I added it to a list of known issues in #472 since I didn't want to get bogged troubleshooting while testing the changes:

Toolbox doesn't launch on Python 3.8 due to error drawing the GUI. Let's address that in a different PR.

Searching on the error message suggests it may be a matplotlib backend issue. I'm going to try setting up the current toolbox on the machine that has this issue and confirm that it is not related to the pytorch synthesizer changes.

ZeroCool940711 commented 4 years ago

@blue-fish Im using Python 3.7 and on Windows 10, I got the demo_toolbox.py to run properly and test it for a while but then I stupidly decided to upgrade pip which broke my whole python installation, Im going to sleep soon as its late where I live but when I wake up I will test everything from scratch with a clean installation of Python 3.7 using anaconda, I will let you know how things go, this would probably be really useful to know how the code behaves with a clean installation.

ghost commented 4 years ago

@ZeroCool940711 Any information you can provide on this issue would be greatly appreciated. I get the same error on Python 3.8 and Ubuntu 20.04 but it seems specific to that computer.

ZeroCool940711 commented 4 years ago

@blue-fish here is something I tried before going to sleep, I downgraded matplotlib to the version 2.2.4 , it seems to work for me now, at least the GUI shows, give it a try and let me know if that works, I will still try more things later and see what else can fix that, of course after my beauty sleep, I actually need it, I've been scaring too many people recently.

ghost commented 4 years ago

@ZeroCool940711 Thank you for reporting that, unfortunately I get an error when downgrading to 2.2.4. Please let us know if you find other solutions that work.

@Ananas120 Do you have anything new to report for your model training? Would like to hear wavs at some point. I suggest you open a new issue (call it "Tensorflow 2.x implementation of SV2TTS" or similar) and you can use that for sharing results and discussion pertaining to your models.

have you continued to train your model or not ? And if yes, do you have better results ?

I am optimistic we can resolve the issues of gaps (#53) by training on a better dataset with fatchord's tacotron. At a low number of synthesizer steps I can synthesize the entire "Welcome to the toolbox!..." text without any gaps. This is training on a curated VCTK with preprocessing to trim silences and punctuation removed from transcripts.

If the latest experiment goes well (reverting to Corentin's encoder) then I will attempt to train a synthesizer on the full VCTK dataset. Then if that gets good results we can release it as a baseline model to get #472 merged.

Ananas120 commented 4 years ago

@blue-fish i conntinue my tests but the actual model is not really good... i have OOM after 1 epoch and don’t understand why so it makes training slower and less efficient because i have to re-launch it

I try with different parameters for my loss but i think the attention can’t be learned because of my 20-frames / optimization training (because i have so many times the optimization for a 0-gate) and then i suppose it learns to generate only 0 and never 1 (then bad audios, bad inference and no attention learned)

Another supposition is that is only a too small training time but i don’t know how many times it can take I tried to embed my dataset with the encoder of this repo but it takes so many times that i stop it

I think i will try to check your pytorch model and see if i could not recode it in tf2.0 and transfert your weights, i think it can be a better pretrained model than a partial transfert (like i do actually)

Edit : i just look at your pytorch model but it is so different than mine... i can’t convert it easily they are too much different ^^’ But you give me a really good idea to avoid my partial transfert learning ! I will create a Dense layer after the concatenation of the encoder and speaker embedding and then my attention mechanism will have the full pretrained weights so i hope it wil help ! I think i will also make a few steps (lilke 1 epoch or 1k steps) training only the encoder and the dense layer like that i will not « damage » the pretrained attention mechanism and pre-train the new Dense layer

One really strange thing in your model is that you have no « gate projection » which is really useful to predict when the model should stop during inference

ghost commented 4 years ago

One really strange thing in your model is that you have no « gate projection » which is really useful to predict when the model should stop during inference

@Ananas120 please see: https://github.com/fatchord/WaveRNN/issues/74 for an explanation. Predicting empty frames actually works well enough in practice with a good model. With a bad or insufficiently trained model, it results in premature synthesis termination which I have experienced during testing.

Edit:

i just look at your pytorch model but it is so different than mine... i can’t convert it easily they are too much different

fatchord's synthesizer is tacotron1 and not tacotron2. Take a look at this repo, it is written in tensorflow 1.x and might match closer: https://github.com/keithito/tacotron

Ananas120 commented 4 years ago

Oh ok i understand better, it’s tacotron1 and me is version 2...

I also understand your « no-gate » but it will not work with my vocoder (i think ?) which is based only on Conv and no RNN But this is not a problem, i will keep my idea to add Dense layer and train encoder without training decoder for a few steps, i think it can be really interesting to see if it works better !

ZeroCool940711 commented 4 years ago

@blue-fish sorry for taking so long to answer, I tried multiple things and seems like the matplotlib problem is not limited to version 2.2.4, I found that any version before 3.1.2 will work, at least on my computer, maybe its different for Ubuntu, I will keep trying other stuff, and keep you inform, if you can try using matplotlib version 3.1.2 and let me know if that works.

ghost commented 4 years ago

@ZeroCool940711 Thank you for the suggestion but matplotlib 3.1.2 doesn't work for me, and for some reason I am unable to install any version earlier than that.

I think we should troubleshoot the error and figure out what is causing it, instead of working around it with different matplotlib versions. If you have a working setup for the toolbox why don't you just go ahead and have fun with it, and come back to this later if you are still interested. I really appreciate your willingness to help out, but until several more people report this issue it might not be worth our time fixing it.

ghost commented 4 years ago

@ZeroCool940711 Let's continue the discussion about this error in #504 as it is most likely unrelated to the pytorch synthesizer.

ghost commented 4 years ago

There's a pytorch version of Rayhane's tacotron-2. Wish I found this earlier! https://github.com/begeekmyfriend/tacotron2

As a last resort we could add support for this tacotron2, and transfer over the pretrained weights from tensorflow.

Ananas120 commented 4 years ago

@blue-fish in fact why do you want to convert the model to a pytorch one ? Just for facility or because it can run faster on GPU ? Also when you run the toolbox, are the 3 models running on GPU or only those in pytorch (or tf) ?

Because when i want to run my model (tf) with my vocoder (pt), it always raises OOM but when i runs the 2 models in pytorch, no OOM so i suppose pt and tf can’t run in parallel or have i missed something ?

Ps : if you want to see the new results of my tf implementation, i update the first message in #507 every new epoch Actually i am training the model with the embedding done by the encoder of this repo and have a loss of 1.14 at epoch 4 step 500 so not as good as i had with my encoder but it decreases slower because the embedding is size 264 (my embeddings were size 64) so i hope this model trains slower but can decreases lower

Ananas120 commented 4 years ago

I just think about that but my convertor tensorflow to pytorch is based on names (i think, or it’s really easy to adapt it to base on names) then you can just use the checkpoint, extract variables with my code above and pass it to the converter and pass the pytorch model of the repo you found and then it should be good ;)

ghost commented 4 years ago

@blue-fish in fact why do you want to convert the model to a pytorch one ?

@Ananas120 The synthesizer is written in tensorflow 1.x . There are 2 big problems with this:

  1. The tensorflow 1.15 released binaries require CUDA 10.0 for GPU acceleration. This is an old version and only a narrow range of NVIDIA drivers will support it.
  2. TF 1.15 is not available or python 3.8+.

Toolbox setup is getting harder over time because of these 2 issues. We have the choice of upgrading to TF 2.x (#370) or using a pytorch synthesizer (this issue). There is a preference for pytorch since the encoder and vocoder already use it. The vocoder is the slowest part by far so speed doesn't matter as much for the synth.

Ananas120 commented 4 years ago

Ok i see, i didn’t know you already tried to convert the model to tf 2.x The easiest way if you want to do this is to copy the pytorch implementation you found and rewrite it in tf2.x because the 2.x version is really similar to pytorch (layer names, arguments, functions, subclassing model, ...) then it can be really easy to convert it (convert the original 1.x model is really hard because tf 1.x is really different)

The actual checkpoint should work for the 2.x implementation if you achieve to make same names for layers (or you can convert the checkpoint with my convertor code)

ghost commented 4 years ago

Thank you for that suggestion. There is a tensorflow 2.x version of our synthesizer already (without the SV2TTS modifications for speaker embedding): https://github.com/TensorSpeech/TensorFlowTTS

That should be a better starting point. The model is actually the easy part, getting it properly interfaced with the rest of our code is a lot of work. I'm not up to the task of rewriting the supporting infrastructure in TF 2.x but I'll take a look to get a better understanding of how much work it would entail.

Ananas120 commented 4 years ago

I think it will not be as big as you think because tf2.x is really similar to pytorch for the inference and for the actual code you should just modify the inference methods by replacing the sess.run(...)call by a model(...)call (like in pytorch), i think i can be good with just that changes (for inference)

For training it’s a little bit more complicated i think but you can inspire you from my github to see how an optimization step works in tf2.x, it’s not as difficult as that (it’s really similar to a pytorch optimization step in fact)

def optimize_step(inputs, target):
    with tf.GradientTape() as tape:
        y_pred = model(inputs, training = True)

        loss = loss_fn(target, y_pred)

    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

    return loss

For the loss function, you can copy the one from the NVIDIA repo (or i can share my loss if you want, it’s inspired from the NVIDIA’s one), it’s just the sum of MSE on mel and postnet and a BCE on gate

You can have something like this :

def loss_fn(y_true, y_pred):
    mel, gate = y_true
    mel_pred, mel_postnet_pred, gate_pred = y_pred

    mel_loss = tf.reduce_mean(tf.square(mel - mel_pred))
    mel_postnet_loss = tf.reduce_mean(tf.square(mel - mel_postnet_pred))
    gate_loss = tf.keras.losses.binary_crossentropy(gate, gate_pred)

    return mel_loss + mel_postnet_loss + gate_loss
Ananas120 commented 4 years ago

Oh shit ! i just discover that my model was not as bad as i thought... In fact when i made prediction during training, i forgot to remove the padding and then the output of my vocoder was bad but not because the spectrogram was bad but it was the padding that « damages » the whole output audio...

So i removed padding and in fact the quality of the audio is not so bad with a loss around 1.05 so... i think my old models with 0.9 loss could be really good (but can’t test, i deleted them...)

The main issue now (and also present in the old models) is the gate-loss and the attention mechanism, the gate-loss doesn’t decrease below 0.01 and the attention mechanism seems not to be learned so at the end, the model is bad for inference If you have some ideas to learn faster the gate part, you are welcome !

Some ideas i will test this week :

My vakanties ends in less than 2 weeks so i will not be able to make many tests but i hope one of these ideas will work !

ghost commented 4 years ago

It's good you found the problem instead of throwing away your model! I had a similar experience where my in-work synthesizer was scaling to the wrong range [-1, 1] and not [-4, 4], as well as and padding mels with the wrong value (0 instead of -4). When starting out it is helpful to invert with a basic algorithm like Griffin-Lim to work out these types of issues.

Please share some wavs of your synthesizer in #507 when you get a chance?

Ananas120 commented 4 years ago

Oooh maybe it’s because i pad mels with 0 that the vocoder is impacted by them, maybe i should pad with another value... i will invastigate this because it can really fast up my inference ! (actually i do all vocoder inference in 1 audio per batch so not really optimal)

Yes of course, my training ends tomorrow morning so i will share the 4 last predictions if you want (i will see if inference is interesting, if not i will post prediction with teacher forcing and target so you can compare original, prediction and inference) I hope my gate-loss will dcreases below 0.01, it never goes below till now... (except in one old model where it decreases around 0.0095)

Edit : it’s strange, i just look at the latest prediction (step 12k) and the attention mechanism seems to work well when teacher forcing but during inference... nothing (and gate loss is still at 0.011) I suppose the cause is my custom training procedure with many steps with gate at 0 and only some steps (the latest) with gate at 1 and then it just predicts 0 every times... but it doesn’t explain the so bad inference

Ananas120 commented 4 years ago

@blue-fish i tried some things but nothing work, the prediction is not so bad but inference is always bad (only noise or silence)... Do you think it may be because i don’t trim silence ?

ghost commented 4 years ago

Closing this issue (can follow progress in #472)