danpovey / pocolm

Small language toolkit for creation, interpolation and pruning of ARPA language models
Other
90 stars 48 forks source link

Feature request #62

Closed danpovey closed 8 years ago

danpovey commented 8 years ago

There is a feature that I need for Kaldi reasons-- it will allow us to model <unk> with a phone language model and only have that appear once in the resulting graph Basically what I want is that <unk> should never be preceded by anything in the history.

So suppose in get-text-counts we are getting counts for: a b <unk> c d and assuming 4-gram, then the sequence of n-grams consist of (in their natural order):

<s> -> a
<s> a -> b
<s> a b -> <unk>
<unk> -> c
<unk> c -> d

so <unk> should behave a little like <s> in how it appears in the history (i.e. nothing ever precedes it). You will need kUnkSymbol. This should be done as an option to get-text-counts. Currently the usage is get-text-counts <ngram-order> Now it should be:

get-text-counts [--limit-unk-history] <ngram-order>
This program reads lines of integerized text and outputs raw n-grams in
text form, one per line, in the format 
<reversed-history> <predicted-word>
e.g.
6     5      7 
See comments in code for more details, and get_counts.py for examples.
If the option --limit-unk-history is given, then any history greater
than bigram history that is to the left of <unk> (symbol number 3) 
will be truncated (this relates to keeping decoding graphs compact
for Kaldi purposes).