NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.07k stars 1.38k forks source link

Training on corpus other than LJ Dataset #121

Closed yxt132 closed 5 years ago

yxt132 commented 5 years ago

This is great work! I was able to obtain amazing results using the LJ Speech Dataset. I used the default hyper parameter settings and used Waveglow.

However, I am curious how it will perform in other training Corpus. So I tested it on the VCTK corpus (https://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) and I failed to get good results. All I am getting is noise even with 100,000+ epochs. I am wondering if it is because of much less sentences spoken by each speaker (400 sentences by 109 speakers) in this training corpus. I tried one single speaker as well as all speakers and all the results are noises (I also trained the Waveglow model for the same speaker). I am relatively new in this field. So maybe I am missing something. Would you mind sharing some of your thoughts on which could improve the results? Do I need to change the code to work for multi-speakers? What's your recommendation for the hyper parameters?

Or should I use other synthesis training corpus that are publicly available? Any recommendation? Thank you very much!

MotorCityCobra commented 5 years ago

Probably 400 is way too small. I've been training on an M-AILABS Speech Dataset of audio books read by Judy_Bieber that is 15,700 utterances and it's not working out for me.
I thought I matched the LJ set but after 90,000 iterations it is gibberish just like it was at 1000 iterations. (The LJ set could speak fairly well at 90,000).
Has anyone gotten this to train on a different voice? Post your checkpoint please.

rafaelvalle commented 5 years ago

When training other datasets and assuming you have at least say 14 hours:

With smaller datasets one will need to make modifications to the model to prevent it from overfitting. For example, reducing the number of dimensions on the prenet layer can help.

MotorCityCobra commented 5 years ago

make sure the audio matches the transcription

I did spot checks on this. The M-AILABS sets come with csvs I put through Pandas to combine and randomly shuffle >>> then save to txts. The random checks I did to the final txt and the final wav paths for the model check out.

I didn't know you would see this here. I opened another issue for this with more detail - I will check before and after silence and close that issue if this is the remedy. This is what I will spend tomorrow doing.

LASTLY: Can you clarify what you mean by symbols matching? The transcripts of the utterances in the M-AILABS txts are pretty clean and I made sure to keep the '|' symbol and match the txt otherwise.
I wonder if there is some kind of encoding difference going on.

MotorCityCobra commented 5 years ago

screen shot 2019-02-06 at 10 20 44 pm

Yep. M-AILABS wavs are the ones in the back window and LJ in the front.
I'll have to write a function to trim 0.5 seconds off the front of a couple ten thousand wav files.

rafaelvalle commented 5 years ago

Use ffmpeg instead because it's based on amplitude. Also, it looks like there's some amplitude offset your M-AILABS wavs. Notice silence is higher or lower than 0...

find * -type f -name "*.wav" -exec ffmpeg -i "{}" -af silenceremove=1:0:-80dB "trimmed_{}" \;
rafaelvalle commented 5 years ago

To make sure that the symbols used in the language are the same symbols in the list of symbols. If you're using english, the current symbol list should be fine.

MotorCityCobra commented 5 years ago

amplitude offset your M-AILABS wavs.

Oh, I didn't notice that. I wonder if that's a problem and if it's fixable, since the offset doesn't look like it's off in a consistent way.

You mean the set of characters in the txt files of the transcripts of the wav files? I brought all in with Pandas with

all1 = pd.read_fwf('/path/to/txt/metadata', header=None)

and concatenated and shuffled. The first line before the '|' is reading the path to the wav each time (or it would throw an error), so I assume it is reading what the wav speaks.

adimukewar commented 5 years ago

I am trying to fine tune the model on a relatively smaller dataset [NEU Emotional Speech Corpus]. Is it possible to share the checkpoint file for the complete LJSpeech training? Thanks

rafaelvalle commented 5 years ago

@adimukewar we updated the code and repo to such that people can train their models starting from our pre-trained models.

feddybear commented 5 years ago

I would like to reopen this issue to ask about other languages that use a different character set. What are things that we have to double check?

For example, I noticed that by default, the text filter is set to english_filters but there's also the transliteration_filter that caters to non-English text. Can someone elaborate or point me to an existing discussion on this? Thanks.

rafaelvalle commented 5 years ago

@feddybear https://github.com/NVIDIA/tacotron2/issues/106 You can also create text cleaners for your own data.