eddieantonio / mitlm

Automatically exported from code.google.com/p/mitlm
http://code.google.com/p/mitlm
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

interpolate-ngram: -unk with -lm is not implemented yet. #8

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
I have a background unigram model (bg.arpa), some additional training data
(train.txt) and some dev text (dev.txt). I want to create an interpolated
unigram that optimizes the perplexity of dev.txt. I also need open
vocabulary LM (-unk).

I execute:
$ interpolate-ngram -l bg.arpa -t train.txt -op dev.txt -o 1 -wf
"entropy:train.txt" -unk 1 -v etc/vocab

I get:
...
Loading component LM bg.arpa...
-unk with -lm is not implemented yet.
-- RefCounter----------
map[0x2aaaab5a5010] = 0
map[0x2aaaab0dc010] = 1
map[0x5a9c60] = 0
map[0x5a94f0] = 0
map[0x2aaaab1dd010] = 1
map[0x2aaaab018010] = 1
map[0x5a9f00] = 0
map[0x5a9cf0] = 0
-----------------------

Without -unk it seems to work fine.
OK, I understand it's not implemented, but maybe it's just a simple fix...
Otherwise, I think I know a workaround. Thanks.

Original issue reported on code.google.com by alu...@gmail.com on 26 Feb 2009 at 4:42

GoogleCodeExporter commented 8 years ago
The main issue here is that when we read in an n-gram containing an OOV word, 
we need
to change it to <unk> and cumulate (not replace) the probabilities.  The backoff
weights for these n-grams that contain <unk> will also need to be recomputed.

If you can't find a workaround, let me know.  Otherwise, it might take me a 
while to
get to this as I am transitioning to my new job after grad school.

Original comment by bojune...@gmail.com on 26 Feb 2009 at 4:52

GoogleCodeExporter commented 8 years ago
I also came across the same problem of interpolating open vocabulary language 
models.
At first, I created closed vocabulary models and interpolated them using mitlm. 
So
the mitlm optimized the weights for the individual language model components:

~/projects/mitlm.0.4/interpolate-ngram \
  -lm "corp1.lm, corp2.lm, corp3.lm, corp4.lm, corp5.lm" \
  -vocab vocab.txt \
  -interpolation LI \
  -op corp_dev.txt \
  -wl interpolated12345.lm

The output of this command contains:
...
OptParams     = [ -3.712318 -5.325820 -1.963826 -1.999009 ]
...

Then I created open vocabulary models with the same parameters, with only "-unk"
added. These models can not be interpolated using mitlm, but srilm can do that, 
so
you only need to convert mitlm weights to srilm lambdas. Of course you have to 
put
the models into mitlm and srilm in the same order.

-----------------------------
#!/bin/bash

mitlm_interpolation_weights="-3.712318 -5.325820 -1.963826 -1.999009"

all_weights=`octave -q --eval "a=[ $mitlm_interpolation_weights ];
disp(1-sum(exp(a))); disp(exp(a(2:end)));" | tr "\n" " "`
lambda1=`echo $all_weights | awk '{print $1}'`
lambda2=`echo $all_weights | awk '{print $2}'`
lambda3=`echo $all_weights | awk '{print $3}'`
lambda4=`echo $all_weights | awk '{print $4}'`

echo $lambda1 $lambda2 $lambda3 $lambda4

/usr/local/share/Srilm/bin/ngram \
  -lm corp1.unk.lm \
  -mix-lm corp2.unk.lm \
  -mix-lm2 corp3.unk.lm \
  -mix-lm3 corp4.unk.lm \
  -mix-lm4 corp5.unk.lm \
  -lambda $lambda1 \
  -mix-lambda2 $lambda2 \
  -mix-lambda3 $lambda3 \
  -mix-lambda4 $lambda4 \
  -unk \
  -write-lm interpolated12345.unk.lm
-----------------------------

Please correct me if I am doing something wrong, but for me it seems to work 
well.
Miso

Original comment by michal.f...@gmail.com on 3 Mar 2010 at 1:21

GoogleCodeExporter commented 8 years ago
Hmmm, it seems that my previous solution is not correct. Now I do the 
interpolation
of open vocabulary models in this way:

1. Build open vocab models with -unk
2. Change <unk> to fake_unk: sed 's/<unk>/fake_unk/g' < lm > lm.fake_unk
3. Interpolate models as closed vocab without -unk
4. Change fake_unk back to <unk>: sed 's/fake_unk/<unk>/g' < interp.lm.fake_unk 
>
interp.lm

Few notes about the interpolation parameters:

* ARPA file probabilities are in log10
* N-1 parameters are trained during interpolation (N is number of LMs)
* Weights for individual LMs are: weights = [ 1 exp(param1) exp(param2) ...
exp(paramN-1) ]
* The interpolated probability for a word W is:
  * ( prob_LM1(W) * weights(1) + probLM2(W) * weights(2) + ... probLMN(W) *
weights(N) ) / sum(weights)
  * ...where probLM1(W) is the probability (not in log anymore) of word W for the LM
number 1 

% Matlab code for interpolation:
params = [ 0.798976 ]               % parameters from mitlm interpolation
weights = [ 1 exp(params) ]         % weights of individual LMs
probs_LM = [ -1.853477 -3.204265 ]  % log10 probabilities of some word W in the
individual LMs
probs_LM = 10.^probs_LM             % covert to normal probabilities

probs_LM*weights' / sum(weights)    % interpolated probability
log10(ans)                          % convert to log10 -> the number in the 
arpa file

Original comment by michal.f...@gmail.com on 5 May 2010 at 5:25