MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.26k stars 242 forks source link

Creating chunks of Buckeye Corpus #775

Closed shreeshailgan closed 3 months ago

shreeshailgan commented 4 months ago

In the 2017 InterSpeech paper, section 3.1: Datasets includes the following sentence:

We thus broke up Buckeye into chunks bounded by non-speech (pauses, noise, interviewer speech) of >150 msec... 

I am confused here. What is the >150 ms filtering applied to?

1] the length of the chunks (end _time of last token - start_time of first token in the chunk) i.e., chunks with duration < 150 msec are discarded

OR

2] the non-speech tokens that are used to separate the chunks i.e, non-speech tokens with duration < 150 msec are not used to split the ongoing chunk, instead it is included and we continue.

mmcauliffe commented 3 months ago

The second, things like \, \, \ etc are used as boundaries for utterances if they're greater than 150ms in duration, otherwise surrounding speech is combined into a single utterance (and these labels aren't passed along because MFA does the silence modeling without needing the explicit labels).

You can see the current version of the script for creating the benchmark dataset here: https://github.com/MontrealCorpusTools/mfa-models/blob/main/scripts/alignment_benchmarks/data_prep/create_buckeye_benchmark.py