DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.4k stars 158 forks source link

Train a Finnish checkpoint from scratch #164

Closed Annie-Zhou1997 closed 5 months ago

Annie-Zhou1997 commented 5 months ago

If I want to train a Finnish checkpoint from scratch using Toucan, the dataset I am using is css10fi, which has about 10.5 hours of data. How many steps should I train approximately to achieve good results? I have already trained up to 280k steps, but the quality still bad and can't match that of the Finnish produced by the pre-trained checkpoint. I look forward to your reply!

Flux9665 commented 5 months ago

280k steps is already a lot, usually a training from scratch on this amout of data should be done after 100k steps. The data is also decent quality, so I don't think that this is the issue either. Have you changed any settings, like the batchsize or the learningrate?

Annie-Zhou1997 commented 5 months ago

Thank you very much for your prompt reply! I haven't changed any settings, but I have a bit of confusion about the training part, and I'm not sure if I made the changes correctly. First, I modified the build_path_to_transcript_dict_css10fi function to point to my actual dataset path:

def build_path_to_transcript_dict_css10fi():
    path_to_transcript = dict()
    language = "finnish"
    with open("/scratch/s5480698/fi/transcript.txt", encoding="utf8") as f:
        transcriptions = f.read()
    trans_lines = transcriptions.split("\n")
    for line in trans_lines:
        if line.strip() != "":
            path_to_transcript[f"/scratch/s5480698/fi/{line.split('|')[0]}"] = \
                line.split("|")[2]
    return limit_to_n(path_to_transcript)

Then I copied the finetune_example, only changing the dataset part:

finnish_datasets = list()
finnish_datasets.append(prepare_fastspeech_corpus(transcript_dict=build_path_to_transcript_dict_css10fi(),
                                                corpus_dir=os.path.join(PREPROCESSING_DIR, "CSS10fi"),
                                                lang="fi", fine_tune_aligner=False, ctc_selection=False))

all_train_sets.append(ConcatDataset(finnish_datasets))

model = ToucanTTS()
if use_wandb:
    wandb.init(
        name=f"{__name__.split('.')[-1]}_{time.strftime('%Y%m%d-%H%M%S')}" if wandb_resume_id is None else None,
        id=wandb_resume_id, resume="must" if wandb_resume_id is not None else None)
print("Training model")
train_loop(net=model,
           datasets=all_train_sets,
           device=device,
           save_directory=save_dir,
           batch_size=12,  # YOU MIGHT GET OUT OF MEMORY ISSUES ON SMALL GPUs, IF SO, DECREASE THIS.
           eval_lang="fi",  # THE LANGUAGE YOUR PROGRESS PLOTS WILL BE MADE IN
           warmup_steps=500,
           lr=1e-5,  # if you have enough data (over ~1000 datapoints) you can increase this up to 1e-3 and it will still be stable, but learn quicker.
           # DOWNLOAD THESE INITIALIZATION MODELS FROM THE RELEASE PAGE OF THE GITHUB OR RUN THE DOWNLOADER SCRIPT TO GET THEM AUTOMATICALLY
           # path_to_checkpoint=os.path.join(MODELS_DIR, "ToucanTTS_Meta", "best.pt") if resume_checkpoint is None else resume_checkpoint,
           path_to_embed_model=os.path.join(MODELS_DIR, "Embedding", "embedding_function.pt"),
           fine_tune=False,# if resume_checkpoint is None and not resume else finetune,
           resume=resume,
           steps=300000,
           use_wandb=use_wandb)
if use_wandb:
    wandb.finish()

Since I wanted to train from scratch, I commented out the step of calling the meta pre-trained checkpoint. Then I called from TrainingInterfaces.TrainingPipelines.ToucanTTS_Finnish import run as finnish and used this command to train:python3 run_training_pipeline.py finnish --gpu_id 0 I'm not sure exactly where the problem is. Thank you very much!

Flux9665 commented 5 months ago

You did everything correctly, that's all good :)

The problem is most likely that some of the settings in the finetune_example are set up specifically for finetuning.

Try a higher learningrate and more warmup steps when training from scratch. For lr, something between 0.001 and 0.0005 usually works well. And for warmup steps I would go with a few thousand, maybe 4000.

I'm working on a new version that will be released in a few weeks, might be worth to try again once that's done.

Annie-Zhou1997 commented 5 months ago

Thank you very much for your reply! This afternoon, I tried training from scratch using the LJSpeech English dataset, and by 60k steps, the results were already quite good. I followed your suggestion in the comments to change the learning rate to lr=1e-3, and used the default warm-up steps and total training steps in the train_loop. I think there might be some issues with my Finnish dataset, as I noticed many errors in the transcription files. I enabled the ctc function and manually corrected the dataset, hoping it will be successful this time. I look forward to your new version and wish you success in your work!

Flux9665 commented 5 months ago

Thanks! I'll close this issue for now and we assume that it's the mislabellings in the data. If you find that's not the problem, feel free to re-open the issue.