What kinds of algorithms have you used to segment such long audios?

sungjae-cho commented 4 years ago

What kinds of algorithms have you used to segment such long audios? The forced aligner could have some limitation to segment a long audio at once.

Tomiinek commented 4 years ago

Hello! :slightly_smiling_face:

All books were already splitted on chapter level. Sentences of these chapters were then aligned using Aeneas forced aligner. However, the output alignments were not precise, so I extracted (very short) silent intervals using ffmpeg and shifted the start and end of the alignments to match them (but I allowed just shifts shorter than a threshold).

The alignments are still not 100% precise, especially because of short sentences such as "Yes!" etc. Actually, I have never used the data for training, but I would validate them for example with an ASR model and Character Error Rate of the expected texts and actual outputs.

sungjae-cho commented 4 years ago

Thank you for your reply! :)

What I wanted to know was what the inputs given to Aeneas was. Did you feed a whole chapter audio file with its script to Aeneas? Or, how did you split chapter audios and scripts?

Tomiinek commented 4 years ago

Oh, I see. To be honest, I am not sure as it is a quite long time, but I think I really fed it with whole chapter audio files. There is not much chapters, so the chapter-level splitting can be done manually. I also remember that I removed some parts of chapters of The Jungle Book, as it often contained very expressive speech or singing and the alignment was totally bad. Scripts were prepared with a bunch of regexps and a lot of manual work.

sungjae-cho commented 4 years ago

I naively expect that feeding a chapter audio and its chapter script as a whole could be burdensome, difficult for forced aligners.

I have used your segmentation and I trained Tacotron 2 with the resulting segmented data. I found for some samples it was impossible to learn increasing attention alignment. I looked into those samples and found those audios have completely mismatched scripts. The duration of them was at least 9 hours. I just wanted to let you know this.

I really wanted to get clean, large Blizzard Challenge 2013 data but it seems inevitable to involve human taggers. Don't you agree with it?

Tomiinek commented 4 years ago

I am sorry, but I would expect something like this. A cross-validation of the data is really needed in this case. I am now a little bit busy and I aim at a different research field. But if you would like to improve these the alignments etc., you are welcome!

You can contact @mueller91, he validated the data and successfully used them for training the GST model. His samples are great!

Yes, I think so. I used some multi-lingual TTS datasets last year and their quality was really low. There are problems with alignments, with script normalization etc. At the end, I had to clean heavily them to make them usable (see this repo). I think that the size of data is not so critical in TTS generally. However, the advantage of the large Blizzard 2013 dataset is that it is mono-speaker and includes a huge variety of expressive speech. IMHO, even if you remove the odd samples and cross-validate the dataset, you will end up with a dataset which is for expressive TTS research better than for example LJ Speech.

a-froghyar commented 3 years ago

@sungjae-cho @Tomiinek I have experimented with Aeneas before with very long audio data and I've realised that alignment works almost 100% if the audio is max. 60 minutes. I actually need to segment the Blizzard dataset for what I do now and will open a PR here or do a fork implementation of the segmentation tool. Thanks for your work @Tomiinek!

Tomiinek commented 3 years ago

Cool! You are welcome, do not hesitate to contribute :slightly_smiling_face:

sungjae-cho commented 3 years ago

@a-froghyar Looking forward to it!

Tomiinek / Blizzard2013_Segmentation

What kinds of algorithms have you used to segment such long audios? #1