Clarity regarding vocab_file parameter in the config

NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

https://nvidia.github.io/OpenSeq2Seq

Apache License 2.0

1.54k stars 371 forks source link

Clarity regarding vocab_file parameter in the config #445

Closed shiv6146 closed 5 years ago

shiv6146 commented 5 years ago

@blisc @borisgin What exactly does vocab_file mean? Should it be changed when different datasets are used or does it remain the same (27 symbols) irrespective of the dataset used? Does it make sense to use the trie_vocab.txt obtained after download_lm.sh for Librispeech to get better results?

vsl9 commented 5 years ago

vocab_file is actually an alphabet, that is, a set of characters which a model can emit at each time step. There are 28 symbols in normalized LibriSpeech: 26 English characters + apostrophe + space. A CTC acoustic model has an additional CTC blank symbol. Of course, you can extend this alphabet for other datasets (see for example https://github.com/NVIDIA/OpenSeq2Seq/issues/443). trie_vocab.txt is a set of all words in LibriSpeech which is required for building a prefix tree (trie).

shiv6146 commented 5 years ago

@vsl9 Thanks for your insights :+1: This helps!