CorentinJ / librispeech-alignments

Word alignments generated by the Montreal Forced Aligner for the Librispeech dataset
146 stars 23 forks source link

LibriSpeech Alignments

This repo contains links to download word alignments for LibriSpeech, generated using the Montreal Forced Aligner.

The following datasets are included:

Download links

The prepared alignments come in two formats:

Once downloaded, merge the LibriSpeech directory with the original LibriSpeech dataset (only the directory structure will be merged, no files should be overwritten in the process).

Warning: for both archives there will be a set of unaligned utterances (see unaligned.txt), for these files there will simply be no alignment present, so take that into account in your parsers. There are 46 utterances unaligned in the raw TextGrid alignments (from failures of the model) and 127 in the simple format alignments (46 carried out from the first and 81 from being unable to match the text to the ground truth text). I wouldn't bother trying to annotate those in another way, as they still account for less than ~0.05% of the whole dataset.

Format (TXT alignments)

[If you downloaded the first file]: For each book you will find a .alignment.txt file, e.g.:

LibriSpeech/dev-clean/84/121123:
 - 84-121123.alignment.txt
 - 84-121123.trans.txt
 - 84-121123-0000.flac
 - 84-121123-0001.flac
 - ...

Each line starts with the utterance id, followed by the ground truth words and finally the end time for each word. E.g.:

84-121123-0000 ",GO,,DO,YOU,HEAR," "0.490,0.890,1.270,1.380,1.490,1.890,2.09" 
84-121123-0001 ",BUT,IN,LESS,THAN,FIVE,MINUTES,THE,STAIRCASE,GROANED,BENEATH,AN,EXTRAORDINARY,WEIGHT," "0.270,0.420,0.530,0.730,0.870,1.100,1.460,1.580,2.080,2.490,2.780,2.860,3.470,3.830,3.99" 
...

If an utterance was not aligned (see unaligned.txt), there will be simply no line for it.

The list of words and of end times are surrounded by double quotes, and the items are seperated by commas. Silences are represented as empty words, e.g. in the first sentence there is a silence from 0s to 0.49s and the word 'GO' is pronounced from 0.49s to 0.89s. Each sentence is guaranteed to start and end with a silence, even if its duration is 0, this is for parsing convenience.

Format (TextGrid alignments)

[If you downloaded the second file]: For each utterance you will find a .TextGrid file, e.g.:

LibriSpeech/dev-clean/84/121123:
 - 84-121123.trans.txt
 - 84-121123-0000.flac
 - 84-121123-0000.TextGrid
 - 84-121123-0001.flac
 - 84-121123-0001.TextGrid
 - ...

If an utterance was not aligned (see unaligned.txt), the corresponding .TextGrid file will be missing.

You can read about the TextGrid format here.