Open abhinavkulkarni opened 3 years ago
For smearing we are using only unigrams scores. Unk word will use just unk score from kenlm (so kenlm has the unk word as normal word accumulating the full rest probability mass). In case of unk n-gram it is doing back-off on knowing (n-1)-gram + prior on the n-th word, and so on backing to 1-gram, where you have the full pdf.
So you can train kenlm with the same lexicon but on larger corpus to capture more n-grams to have more precise probabilities (without doing backoff), however size of the kenlm also will be growing, so pay attention here because you can end up with >100GB bin file (like if you take common crawl and use only 200k dict for 4gram).
Thanks, @tlikhomanenko, I understand your comment about smearing using only unigram scores and <unk>
token having a non-zero probability mass inside KenLM.
Let us say, I want to train an LM on Google News corpus which is quite huge. Let's say it has 100M unique words (post normalization), but I would only want to maintain a lexicon of 200k top occurring words due to runtime resource constraints.
In this case, do you first train a huge LM arpa on the whole corpus and then prune it so that:
I took a glance at the source of LexiconDecoder.cpp, and I can see this should work. This way I get to keep the size of my lexicon and LM small while still benefitting from the distributional semantics of Google News.
Let me know.
Thanks!
hi @abhinavkulkarni, you can easy handle language model build by kenlm with limited vocab like script preprocess in lexiconfree-model. script to build limited vocab for kenlm and script for building kenlm base limited vocab option in kenlm Im understand this is question you want to know, but not sure. If you have another discuss, just comment. tks
@trangtv57 yep, this is correct comment, kenlm has an option to restrict vocab. This restriction works as 2. pointed by @abhinavkulkarni: so this is done on fly during computation that any ngram (even unigram) containing word not from the file provided will be not accounted. For my use-case I exactly used that. Kenlm don't have the case when you include all unigrams and you prun all others with restricted vocab.
Options:
Hi,
Perhaps, this is a naive question, but do we need to retain non-Lexicon words in KenLM file? I haven't read the KenLM paper but do backoff and smoothing require retaining other non-Lexicon words in LM?
For streaming inference examples, the downloaded files from the wiki have a token, lexicon, and KenLM file. The
LexiconDecoder
can only decode words present in the Lexicon (200k in number). In it, the partial beams are scored based on the prefixes of incomplete words (by using max smearing). Whenever a full word is found, the max-smeared score is removed and the word score from LM is added.Given this, does the LM need to store any other words that are not in Lexicon?
Can you train the LM on a bigger corpus to correctly capture n-gram probabilities and then only retain Lexicon words to reduce it's size?
Thanks!