Min-counts - Githubissues

danpovey commented 8 years ago

I'm adding a note here, although this is not really an 'issue' in the normal sense.

I just checked in code that supports enforcing min-counts. This should make the process of building and pruning LMs about twice faster without affecting perplexity results much. @chris920820 and @keli78, can you please test this? It's done at the stage of get_counts.sh (you should now use get_counts.py, which supports the --min-counts option). Following are some experiments to do, e.g. on the Switchboard+Fisher setup are as follows: Try this for two settings: (a) min-counts=2, (b) min-count for fisher=2, swbd=1 [these min-counts will be applied for orders 3 and higher].

See how much faster the LM estimation is than before. And check that the process of getting the counts does not become too slow (increase the --num-jobs to get_counts.py if it does).
See how the min-counts affect the curve of LM size versus perplexity on dev data as you prune with various thresholds.

Don't bother testing decoding using these LMs versus no-mincount ones for different pruning thresholds, as the differences will likely be too small to measure. But you could do an experiment where you do rescoring with the full no-mincount vs with-mincount LMs, and see if the WER is affected [which is unlikely].

You may discover some bugs as you do this. @chris920820, you could perhaps make a pull request where you replace instances of get_counts.sh with get_counts.py-- I know you already did this, but that pull request is now out of date. Let's wait a bit before removing the old script get_counts.sh.

Dan

vince62s commented 8 years ago

Dan, I am sure you applied the min-counts to order 3 and above to replicate the SRILM behavior, but I really think pruning also lower order ie unigram and bi-gram could be helpful. It does not make sense to keep typos or such in the LM. For your info, Ken Heafield made the change and KenLM supports now unigram pruning.

danpovey commented 8 years ago

If you want to prune unigrams, that's something that can be done while preparing the word-list. The reason why I disallowed pruning bigram counts is that it would have required changes elsewhere in the toolkit [and anyway they can be removed as part of the entropy-pruning operation later on].

Dan

On Tue, Jun 28, 2016 at 7:32 AM, vince62s notifications@github.com wrote:

Dan, I am sure you applied the min-counts to order 3 and above to replicate the SRILM behavior, but I really think pruning also lower order ie unigram and bi-gram could be helpful. It does not make sense to keep typos or such in the LM. For your info, Ken Heafield made the change and KenLM supports now unigram pruning.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/20#issuecomment-229067292, or mute the thread https://github.com/notifications/unsubscribe/ADJVu6MBZGhxHCttB9ausUyBXHl3q0ugks5qQTBmgaJpZM4I-cn3 .

vince62s commented 8 years ago

well my comment was for unigrams and bi-grams ... anyway this can be done differently.

vince62s commented 8 years ago

After last night fix it's running fine now. I am running some ppl and lm size right now to compare various situations.

danpovey / pocolm

Min-counts #20