New ToucanTTS model gives far worse results after finetuning

Ca-ressemble-a-du-fake commented 1 year ago

Hi,

I tried to finetune the new Meta model on a 87 sample dataset in French that I used several times already but now the results is very bad.

I did not change anything to the different parameters (so batch size 12, steps 5000, but lang = "fr"). I tried to listen to the samples after 1k steps and after 5k steps but the results can be barely understood and sometimes cannot at all.

I tried both Avocodo and BigVGan without any difference (there are both bad).

The dataset is 16kHz : is it causing this issue ?

Thanks in advance for your help :smile:

Flux9665 commented 1 year ago

Hmm, that is strange. In my testing, finetunig worked very well. The 16kHz is not the problem, the spectrograms are in 16kHz anyway. How do the spectrograms look like? Do they look ok? And how are the losses looking? are they going down? Especially the L1 loss (spectrogram reconstruction) and the glow loss (the normalizing flow that post-processes the outputs) are interesting.

Ca-ressemble-a-du-fake commented 1 year ago

I looked at the spectrograms and the tips are moving from left to right but the horizontal stripes are there. I haven't looked at the losses, I'll tell you. I noticed that the lr was not specified in the fine tuning example so the default 1e-3 was used whereas previously it used 1e-5. And it sounded like catastrophic forgetting. So I will try again with 1e-5 again.

By the way should I reduce the warm up steps to 0 while finetuning so that it does not ofrget anything ?

Flux9665 commented 1 year ago

Yes, this sounds a lot like catastrophic forgetting due to a too high learning rate. I updated the finetuning script to use 1e-5 as the default learning rate, just to be safe. For larger amounts of data, it will still work with 1e-5, it will just not be as fast as it could be, but I want to prioritise the chances of finetuning working.

Warmup is just a measure to prevent drastic weight changes in the beginning of training, while the variance in the weights is still high. It's not really necessary for finetuning, but I think it might provide a little bit of safety against sudden model collapse, so I think 500 steps is still appropriate.

Ca-ressemble-a-du-fake commented 1 year ago

To answer your previous questions the L1 loss, Glow loss, and spectrogram (before or after I don't remember for sure) look like the following :

L1 loss is decreasing, this is what you want, right ? Glow loss looks chaotic but it is zoomed a lot. What is essential to look at in these graphs to tell everything is good, or it could be improved slightly / tremendously ?

thoraxe commented 1 year ago

Where do I find the loss images? I've just tried a training run with the defaults on the ToucanTTS model with a new dataset and it sounds nothing like the original voice. I haven't gone back to try the previous version (24) with PortaSpeech yet. I built a Portaspeech model with a tiny dataset in comparison (different dataset) and it produced fairly convincing results.

Note: I have a lot of very short audio files so I'm not sure if that contributes to the problem.

Ca-ressemble-a-du-fake commented 1 year ago

@thoraxe the loss images are from wandb. You have to set up an account and then pass the parameter--wandb

thoraxe commented 1 year ago

I noticed that my dataset was unclean and had quite a number of voices in it that were not the same speaker. I just modified my dataset and am running training again with ~2.4. When that is finished, I'll evaluate the performance and then try to train again with the fixed dataset v2.5 to see if the issue goes away.

Flux9665 commented 1 year ago

L1 loss is decreasing, this is what you want, right ? Glow loss looks chaotic but it is zoomed a lot. What is essential to look at in these graphs to tell everything is good, or it could be improved slightly / tremendously ?

Yes, exactly, if the L1 loss is decreasing, everything is good. And the glow loss is not changing much, because this part does generally not change much during finetuning, it only decreases stadily during the initial training from scratch. Two things that can happen is that either one of the two losses suddenly spikes up to a value much much greater than any previous value. In that case the model collapsed and the output will become meaningless. The model usually cannot recover from such a failure, but it is not easy to figure out why this happened, especially when it happens in the glow loss.

https://api.wandb.ai/links/flux9665/bcxedjg5

Here the training was going well.

https://api.wandb.ai/links/flux9665/6220k51u

And here the glow loss suddenly spiked and the outputs became meaningless.

The spectrogram you attached looks very nice, I think it should sound pretty good. Are the finetuning problems now resolved with the lowered learning rate? Or is the performance still worse than it was for you? I am experimenting with smaller model configurations, that should in theory be easier to adapt and require fewer datapoints, because they consist of fewer parameters.

thoraxe commented 1 year ago

OK, I am running a fine-tuning job with v2.5, and the initial results are atrociously bad. I more or less have a reproducer, with the dataset. Fine-tuning the portaspeech model with 2.4 works great. Fine-tuning the ToucanTTS model with 2.5 produces something that sounds absolutely nothing like the original source.

Let me know what troubleshooting details/information you would like. Here's kinda-sorta the reproducer:

https://github.com/OpenShiftDemos/ToucanTTS-RHODS-voice-cloning

That repo contains a metadata generator script which is used against https://gitlab.com/mr_belowski/CrewChiefV4/-/tree/master to build the dictionary and clean out unwanted audio files.

It also contains the representative fine-tuning example script used against the dataset.

Here is the fine-tuning script used for Porta and v2.4:

"""
Example script for fine-tuning the pretrained model to your own data.

Comments in ALL CAPS are instructions
"""

import time

import torch
import wandb
from torch.utils.data import ConcatDataset

from TrainingInterfaces.Text_to_Spectrogram.PortaSpeech.PortaSpeech import PortaSpeech
from TrainingInterfaces.Text_to_Spectrogram.PortaSpeech.portaspeech_train_loop_arbiter import train_loop
from Utility.corpus_preparation import prepare_fastspeech_corpus
from Utility.path_to_transcript_dicts import *
from Utility.storage_config import MODELS_DIR
from Utility.storage_config import PREPROCESSING_DIR

def run(gpu_id, resume_checkpoint, finetune, model_dir, resume, use_wandb, wandb_resume_id):
    if gpu_id == "cpu":
        os.environ["CUDA_VISIBLE_DEVICES"] = ""
        device = torch.device("cpu")

    else:
        os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
        os.environ["CUDA_VISIBLE_DEVICES"] = f"{gpu_id}"
        device = torch.device("cuda")

    torch.manual_seed(131714)
    random.seed(131714)
    torch.random.manual_seed(131714)

    # IF YOU'RE ADDING A NEW LANGUAGE, YOU MIGHT NEED TO ADD HANDLING FOR IT IN Preprocessing/TextFrontend.py

    print("Preparing")

    if model_dir is not None:
        save_dir = model_dir
    else:
        save_dir = os.path.join(MODELS_DIR, "Crewchief_Jim")  # RENAME TO SOMETHING MEANINGFUL FOR YOUR DATA
    os.makedirs(save_dir, exist_ok=True)

    all_train_sets = list()  # YOU CAN HAVE MULTIPLE LANGUAGES, OR JUST ONE. JUST MAKE ONE ConcatDataset PER LANGUAGE AND ADD IT TO THE LIST.

    english_datasets = list()
    english_datasets.append(prepare_fastspeech_corpus(transcript_dict=build_path_to_transcript_dict_generic_ljspeech("../CrewChiefV4/CrewChiefV4/sounds/"),
                                                      corpus_dir=os.path.join(PREPROCESSING_DIR, "Jim"),
                                                      lang="en"))

    all_train_sets.append(ConcatDataset(english_datasets))

    model = PortaSpeech()
    if use_wandb:
        wandb.init(
            name=f"{__name__.split('.')[-1]}_{time.strftime('%Y%m%d-%H%M%S')}" if wandb_resume_id is None else None,
            id=wandb_resume_id, resume="must" if wandb_resume_id is not None else None)
    print("Training model")
    train_loop(net=model,
               datasets=all_train_sets,
               device=device,
               save_directory=save_dir,
               batch_size=12,  # YOU MIGHT GET OUT OF MEMORY ISSUES ON SMALL GPUs, IF SO, DECREASE THIS.
               eval_lang="en",  # THE LANGUAGE YOUR PROGRESS PLOTS WILL BE MADE IN
               lr=0.00005,
               warmup_steps=500,
               # DOWNLOAD THESE INITIALIZATION MODELS FROM THE RELEASE PAGE OF THE GITHUB OR RUN THE DOWNLOADER SCRIPT TO GET THEM AUTOMATICALLY
               path_to_checkpoint=os.path.join(MODELS_DIR, "PortaSpeech_Meta",
                                               "best.pt") if resume_checkpoint is None else resume_checkpoint,
               path_to_embed_model=os.path.join(MODELS_DIR, "Embedding", "embedding_function.pt"),
               fine_tune=True if resume_checkpoint is None else finetune,
               resume=resume,
               phase_1_steps=5000,
               phase_2_steps=1000,
               use_wandb=use_wandb)
    if use_wandb:
        wandb.finish()

I definitely downloaded the models ahead of time with the downloader script:

ls -l Models/ToucanTTS_Meta/
total 181240
-rw-r--r--. 1 1002460000 1002460000 185582127 Apr 18 13:42 best.pt
(app-root) (app-root)

The above is during the Toucan-based training run showing that the Toucan meta model was downloaded previously.

This folder contains the two models made from the same dataset. porta is v2.4 and toucan is v2.5: https://drive.google.com/drive/folders/1hw1b86Geqjt7PH4KFnoxx1631RPT6kQQ?usp=sharing

Flux9665 commented 1 year ago

It sounds like it may be a sampling rate issue?

Version 2.5 adds the increased compatibility mode to the read_to_file and read_aloud methods, which outputs 48kHz 16bit int PCM audio, because this has the best compatibility across devices and operating systems. The usual audio without the increased compatibility mode is 24kHz 32bit float. Did you at any point manually specify sampling rates somewhere?

I am now training a model with greatly reduced parameter count, so that small datasets can be used more easily for finetuning at the expense of some quality, which will become the new default for version 2.5.

But in my testing version 2.5 never had any problems finetuning on my little self recorded 100 sample German testing dataset. So I'm wondering where all these issues come from. For now I turned the increased compatibility model off by default, in case this has something to do with it (although I don't see how unless someone manually sets an incorrect sampling rate rather than using file metadata).

thoraxe commented 1 year ago

Did you at any point manually specify sampling rates somewhere?

You have the exact training code I used in both cases. When I run the audio generation, I don't set any parameters, either. I just use the read_texts with the model id, sentence, and output file.

You could even reproduce everything with the scripts that I've shared. The dataset is in the git repo and the training scripts are shared.

Flux9665 commented 1 year ago

I unfortunately don't have time to do debug or reproduce.

I was just wondering whether you opened and saved the file again after it was generated, where a sampling rate mistake could have occured. Since that doesn't seem to be the case it looks like it's actually the model acting like this and not just a simple bug.

I made a new version, which has finished training, but I didn't have time to test it yet. I hope I can find some time in the next days to try it out, both for out of the box use and for finetuning. I will update the release with the new code and checkpoint if it works as intended. I still don't see a reason for why it suddently fails like this, so the issue might still be present somewhere in the new version.

Ca-ressemble-a-du-fake commented 1 year ago

Quick feedback from my side. After pausing the cloning for one week and restarting the computer, it works great (v2.5). I will try to improve the dataset with adobe api that you advised somewhere else, and try again to see improvements.

As explained in another post the quality degradation problem may stem from the python virtual environment not updated the clean way. So now everything looks in order.

DigitalPhonetics / IMS-Toucan

New ToucanTTS model gives far worse results after finetuning #134