Not getting alignment properly

hongseoi commented 1 month ago

Hi! I trained tacotron2 more than 60000 steps but I cannot get alignment properly. The alignment graph is as follows. Does anyone know the cause of this?

alignment chart

I'm training using 100 samples of elderly voice data selected from the common voice dataset.

Training performance was not good in previous attempts, so I looked for other issues.

if the batch size is reduced, the learning rate must also be reduced

    use_saved_learning_rate=False,
    learning_rate=0.25*1e-3,
    weight_decay=1e-6,
    grad_clip_thresh=1.0,
    batch_size=16, #64
    mask_padding=True  # set model's padded outputs to padded values
)

padding_idx=0 should be added as a hyperparameter to the embedding

But sadly it didn't work.

hongseoi commented 1 month ago

Use Sox to remove silence in the audio file. It's not yet a complete success, but some improvements have been made.

import subprocess
import os
import glob

def remove_silence(input_file, output_file):
    try:
        # sox
        subprocess.run([
            'sox', input_file, output_file, 'silence', '2', '0.1', '1%', 'reverse', 'silence', '2', '0.1', '1%', 'reverse'
        ], check=True)
        print(f'Successfully removed silence from {input_file} and saved to {output_file}')
    except subprocess.CalledProcessError as e:
        print(f'Error occurred: {e}')

def process_folder(input_folder, output_folder):
    # mkdir output folder
    os.makedirs(output_folder, exist_ok=True)

    # process all of the wav files in the input_folder
    for wav_file in glob.glob(os.path.join(input_folder, '*.wav')):
        file_name = os.path.basename(wav_file)
        output_wav = os.path.join(output_folder, file_name)
        remove_silence(wav_file, output_wav)

input_folder = '~/data/train'
output_folder = '~/data/processed_train'

process_folder(input_folder, output_folder)

hongseoi commented 3 weeks ago

screenshot

It was a really simple problem

resampling your audio as 22050 (because the sample rate of data that used in the pretrained model is 22050 and the pretrained model is also adjusted to that)
check your audio bit depth and change max_wav_value in hparams.py as you change sample rate

hongseoi commented 3 weeks ago

https://www.semanticscholar.org/reader/57c38167e0fa7c045c7fa6d9783216c7d725f6ad

NVIDIA / tacotron2

Not getting alignment properly #628