acetylSv / GST-tacotron

Reproducing Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis (https://arxiv.org/pdf/1803.09017.pdf)
61 stars 4 forks source link

questions about the datasets. #2

Open zyj008 opened 6 years ago

zyj008 commented 6 years ago

Hello! I have some questions about the BZ datasets. Do you have data preprocessing operation on the BZ dataset before training the model, such as breaking long sentences into small segments? Some sentences in BZ datasets are much longer than sentences in LJ.

acetylSv commented 6 years ago

Hi, I plotted character lengths of each line in transcription into histogram and got this plot. So I decided to discard sentences whose character length > 300.

zyj008 commented 6 years ago

Thanks for your reply! I still have other questions. How many hours of BZ dataset you used for training? I found it is hard to converge well when training about 100 hours dataset. The alignment is usually a fuzzy slash.
image This image shows alignment for 140k steps, batch_size=24, num_gpu=4. Do you have some ideas or advice for me? Could you share your tfevents file for your pretrained model? Thank you!

acetylSv commented 6 years ago

I used only the segmented part of Blizzard-2013 dataset which contains 9742 files with about 20 hrs. So I'm not sure what will happen if switching to the bigger one. In my experience, the attention plot will some how "suddenly" learn to align well at 40K steps (batch_size=32). Maybe the maximum length of your training pairs is set too long?