NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.11k stars 1.39k forks source link

Tacotron2 Issues with Inference and using a Custom Dataset #542

Closed conceptofmind closed 4 months ago

conceptofmind commented 2 years ago

I believe I am currently having an issue when training from both scratch and the pre-trained tacotron2 model.

I have collected 14 to 17 hours of pre-processed wav files of Obama speaking. Each file was initially normalized with ffmpeg-normalize and then resampled to the recommended 22050Hz.

I have ensured that:

Here is a link to a drive containing the wav files for inspection:

https://drive.google.com/drive/folders/17RoPoNhcU6ovW0BBkONt3WEXf6ZvuUwF?usp=download

Here is a link to both of the formatted .txt files (train and val):

Train .txt file: https://drive.google.com/file/d/1dxTkagpAT43jP06QAeODWS92GmuqdPqz/view?usp=sharing Validation .txt file: https://drive.google.com/file/d/1dtaHPWTFdXLM1QdOVb2V9H2a_VMKVWRg/view?usp=sharing

I formatted the .txt files in the same way as the LJSpeech dataset. I used wav2vec2.0 for transcriptions. I made sure that any spaces at the start and end of the transcriptions are removed, and that a period was added to the end of each transcript. Each should be on a new line.

The train.py script will run. The directory paths and naming conventions are correct.

This is what a graph of the training inference looks like at epochs 0, 50, and 100:

Epoch 0:

531816681ab45e27dc0e382df3198f71

Epoch 50:

e926113b3eb88b9e4519cf93804bfd0a

Epoch 100:

fc8476aaad5e143b73bb3ca84a536a3f

Epoch 250:

1f0f98d92629c0fff10c00bc73f5641d

Is this how the charts should be looking? Any help would be appreciated!

@CookiePPP Any input on this?