Closed GoogleCodeExporter closed 8 years ago
Yes, I am aware of this general issue. When -v is specified, I need to set the
vocabulary to that set and only that set. Neither the expansion nor filtering
has
been fully implemented or tested. The issue becomes trickier when specifying
vocabularies for joint smoothing/interpolation optimization (supported, but not
exposed as executables yet).
Please let me know when this will become a blocking issue for your use of
MITLM.
I'll try to make the necessary changes before then. In the mean time, I will be
working on fixing the implementation of count merging.
If you don't mind, can you please update to the latest version, which contains
various performance optimizations and binary LM improvements (smaller file,
faster
save time, but slower load time). If you have to pick between (10s load time,
838MB
file) and (1.5s load time, 1710MB) file, which would you prefer? Do you feel
the
toolkit should support both options?
Thanks.
Paul
Original comment by bojune...@gmail.com
on 12 Dec 2008 at 3:38
It is kind of blocking issue for me now but I'm trying to think of some
workarounds.
Also, I would need the equivalent of the ngram-count -unk option but this is
probably
even trickier for you to implement.
As for the second question, I'm not sure which one to prefer. Probably smaller
files,
but it's not very important.
Original comment by alu...@gmail.com
on 12 Dec 2008 at 3:55
An easy temporary workaround for -unk is to use SRILM to build the counts file
and
build the LM using MITLM. I'll try to add both options over the weekend.
Original comment by bojune...@gmail.com
on 12 Dec 2008 at 4:01
Features
--------
- Added support for --use-unknown to map all n-grams containing OOV words to
<unk>.
- Unigrams backoff to uniform distribution across all vocabulary (including
<unk>).
- Verified --read-vocab filters the text/count input to only those n-grams
containing
non-OOV words.
Cleanup
-------
- Replaced BeginningOfSentence with EndOfSentence to reduce special cases.
Notes
-----
- The --read-vocab filter behaves differently from SRILM. SRILM appears to
filter
out n-grams with OOV as target word initially and later removes ones with OOV
in the
n-gram history when writing out the LM. Thus, the count statistics used to
estimate
the probabilities are slightly different.
To Do
-----
- Need to reintroduce test cases with a small data set to make sure nothing is
broken. The code base at this point is not sufficiently tested.
Original comment by bojune...@gmail.com
on 15 Dec 2008 at 10:43
Original issue reported on code.google.com by
alu...@gmail.com
on 12 Dec 2008 at 3:22