Words specified in the vocabulary but not in training data do not appear in the language model

GoogleCodeExporter commented 8 years ago

I'm not sure if this is a feature or a bug, but when using estimate-ngram
with -v option, the words specified in the vocabulary but that are not seen
in the training data do not appear in the resulting LM. It would be nice if
there was a way to apply some discounting to also estimate the unigram
probabilities of unseen words (i.e. like SRILM's ngram-count does).

I'm using MITLM from SVN.

Original issue reported on code.google.com by alu...@gmail.com on 12 Dec 2008 at 3:22

GoogleCodeExporter commented 8 years ago

Yes, I am aware of this general issue.  When -v is specified, I need to set the
vocabulary to that set and only that set.  Neither the expansion nor filtering 
has
been fully implemented or tested.  The issue becomes trickier when specifying
vocabularies for joint smoothing/interpolation optimization (supported, but not
exposed as executables yet).

Please let me know when this will become a blocking issue for your use of 
MITLM. 
I'll try to make the necessary changes before then.  In the mean time, I will be
working on fixing the implementation of count merging.

If you don't mind, can you please update to the latest version, which contains
various performance optimizations and binary LM improvements (smaller file, 
faster
save time, but slower load time).  If you have to pick between (10s load time, 
838MB
file) and (1.5s load time, 1710MB) file, which would you prefer?  Do you feel 
the
toolkit should support both options?

Thanks.

Paul

Original comment by bojune...@gmail.com on 12 Dec 2008 at 3:38

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

It is kind of blocking issue for me now but I'm trying to think of some 
workarounds.
Also, I would need the equivalent of the ngram-count -unk option but this is 
probably
even trickier for you to implement.

As for the second question, I'm not sure which one to prefer. Probably smaller 
files,
but it's not very important.

Original comment by alu...@gmail.com on 12 Dec 2008 at 3:55

GoogleCodeExporter commented 8 years ago

An easy temporary workaround for -unk is to use SRILM to build the counts file 
and 
build the LM using MITLM.  I'll try to add both options over the weekend.

Original comment by bojune...@gmail.com on 12 Dec 2008 at 4:01

GoogleCodeExporter commented 8 years ago

Features
--------
- Added support for --use-unknown to map all n-grams containing OOV words to 
<unk>.
- Unigrams backoff to uniform distribution across all vocabulary (including 
<unk>).
- Verified --read-vocab filters the text/count input to only those n-grams 
containing
non-OOV words.

Cleanup
-------
- Replaced BeginningOfSentence with EndOfSentence to reduce special cases.

Notes
-----
- The --read-vocab filter behaves differently from SRILM.  SRILM appears to 
filter
out n-grams with OOV as target word initially and later removes ones with OOV 
in the
n-gram history when writing out the LM.  Thus, the count statistics used to 
estimate
the probabilities are slightly different.

To Do
-----
- Need to reintroduce test cases with a small data set to make sure nothing is
broken.  The code base at this point is not sufficiently tested.

Original comment by bojune...@gmail.com on 15 Dec 2008 at 10:43

Changed state: Fixed

eric-bunch / mitlm

Words specified in the vocabulary but not in training data do not appear in the language model #5