flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

Relationship between Lexicon and LM vocab #913

Open abhinavkulkarni opened 3 years ago

abhinavkulkarni commented 3 years ago

Hi,

Perhaps, this is a naive question, but do we need to retain non-Lexicon words in KenLM file? I haven't read the KenLM paper but do backoff and smoothing require retaining other non-Lexicon words in LM?

For streaming inference examples, the downloaded files from the wiki have a token, lexicon, and KenLM file. The LexiconDecoder can only decode words present in the Lexicon (200k in number). In it, the partial beams are scored based on the prefixes of incomplete words (by using max smearing). Whenever a full word is found, the max-smeared score is removed and the word score from LM is added.

Given this, does the LM need to store any other words that are not in Lexicon?

Can you train the LM on a bigger corpus to correctly capture n-gram probabilities and then only retain Lexicon words to reduce it's size?

Thanks!

tlikhomanenko commented 3 years ago

For smearing we are using only unigrams scores. Unk word will use just unk score from kenlm (so kenlm has the unk word as normal word accumulating the full rest probability mass). In case of unk n-gram it is doing back-off on knowing (n-1)-gram + prior on the n-th word, and so on backing to 1-gram, where you have the full pdf.

So you can train kenlm with the same lexicon but on larger corpus to capture more n-grams to have more precise probabilities (without doing backoff), however size of the kenlm also will be growing, so pay attention here because you can end up with >100GB bin file (like if you take common crawl and use only 200k dict for 4gram).

abhinavkulkarni commented 3 years ago

Thanks, @tlikhomanenko, I understand your comment about smearing using only unigram scores and <unk> token having a non-zero probability mass inside KenLM.

Let us say, I want to train an LM on Google News corpus which is quite huge. Let's say it has 100M unique words (post normalization), but I would only want to maintain a lexicon of 200k top occurring words due to runtime resource constraints.

In this case, do you first train a huge LM arpa on the whole corpus and then prune it so that:

  1. any unigram that is NOT in the top 200k list is removed?
  2. any n-gram that doesn't have all its words from the top 200k list is removed?

I took a glance at the source of LexiconDecoder.cpp, and I can see this should work. This way I get to keep the size of my lexicon and LM small while still benefitting from the distributional semantics of Google News.

Let me know.

Thanks!

trangtv57 commented 3 years ago

hi @abhinavkulkarni, you can easy handle language model build by kenlm with limited vocab like script preprocess in lexiconfree-model. script to build limited vocab for kenlm and script for building kenlm base limited vocab option in kenlm Im understand this is question you want to know, but not sure. If you have another discuss, just comment. tks

tlikhomanenko commented 3 years ago

@trangtv57 yep, this is correct comment, kenlm has an option to restrict vocab. This restriction works as 2. pointed by @abhinavkulkarni: so this is done on fly during computation that any ngram (even unigram) containing word not from the file provided will be not accounted. For my use-case I exactly used that. Kenlm don't have the case when you include all unigrams and you prun all others with restricted vocab.

Options: