Question about custom dataset

NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data

BSD 3-Clause "New" or "Revised" License

855 stars 183 forks source link

Closed LucasRotsen closed 4 years ago

LucasRotsen commented 4 years ago

Hi everyone!

Firstly, thank you for the great implementation.

I haven't understood yet how should I prepare my data for training, so I'd appreciate if someone clarifies that for me. My assumptions are:

If I have data from 10 speakers, I should divide it into 2 files in the "filelist" directory (train and val)
Each of those files should contain a representative sample of all speakers
The txt file format should be: path_to_audio|transcripts|speaker_id

Are my assumptions correct?

rafaelvalle commented 4 years ago

Yes, that's a good start! Make sure you trim silences at the beginning and end of each of the audio files and the transcript matches the audio file.

LucasRotsen commented 4 years ago

Thanks for the quick reply, @rafaelvalle !