danpovey / pocolm

Small language toolkit for creation, interpolation and pruning of ARPA language models
Other
90 stars 48 forks source link

About interpolation approach #98

Open pehonnet opened 5 years ago

pehonnet commented 5 years ago

Hi,

I understand from the motivation doc that the idea is to interpolate at the ngram count level, based on a dev set. So, when you want to generate a new LM based on for example 3 sources (trainA, trainB, trainC), you would get your count, find optimal weights based on dev, and then build the LM (lm_1). What would you suggest as an optimal way to create a new LM, when you get only a new source, so you have your initial 3 sources plus a new one (trainA, trainB, trainC and trainD)? Should we simply reuse the counts from the previous training, and the ones from the new set, and interpolate based on the (probably new) dev set? If I understood correctly, it is the only way to do the interpolation with this tool? I.e. we can't use the previously built LM (lm_1) and interpolate somehow with the new data?

Thanks

PS: there is a typo in motivation.md (search "estmiate")