danpovey / pocolm

Small language toolkit for creation, interpolation and pruning of ARPA language models
Other
90 stars 48 forks source link

Training text on a single line. #77

Open francisr opened 7 years ago

francisr commented 7 years ago

I've tried putting my training text on a single line to simulate SRI's continuous-ngram-count, and it worked fine to create an ARPA lm (though I can't split the counts then), however after pruning <s> doesn't appear in the arpa LM ( is still there though), which is annoying in some applications.

danpovey commented 7 years ago

Is there a compelling reason why you need that feature? And would it be feasible to merge into medium-length lines, like 100 words?

On Thu, Oct 13, 2016 at 9:24 AM, Rémi Francis notifications@github.com wrote:

I've tried putting my training text on a single line to simulate SRI's continuous-ngram-count, and it worked fine to create an ARPA lm (though I can't split the counts then), however after pruning doesn't appear in the arpa LM ( is still there though), which is annoying in some applications.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/77, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu99dBkm9pKlYT-IFr_9P38iveQKDks5qzjD9gaJpZM4KV4dV .

francisr commented 7 years ago

It's so this LM can be used with http://www.speech.sri.com/projects/srilm/manpages/hidden-ngram.1.html to add end of sentences.

On 13 October 2016 at 17:34, Daniel Povey notifications@github.com wrote:

Is there a compelling reason why you need that feature? And would it be feasible to merge into medium-length lines, like 100 words?

On Thu, Oct 13, 2016 at 9:24 AM, Rémi Francis notifications@github.com wrote:

I've tried putting my training text on a single line to simulate SRI's continuous-ngram-count, and it worked fine to create an ARPA lm (though I can't split the counts then), however after pruning doesn't appear in the arpa LM ( is still there though), which is annoying in some applications.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/77, or mute the thread https://github.com/notifications/unsubscribe- auth/ADJVu99dBkm9pKlYT-IFr_9P38iveQKDks5qzjD9gaJpZM4KV4dV .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/77#issuecomment-253566937, or mute the thread https://github.com/notifications/unsubscribe-auth/AB-8ZJiMA3zArHIUCgNzM7Wk35BBG5stks5qzl2cgaJpZM4KV4dV .

danpovey commented 7 years ago

Hm. My feeling is that if it's part of a pipeline involving other SRILM tools, it might be better to stay within the SRILM universe. Our more recent experiments have actually failed to show a super-compelling improvement of pocolm versus SRILM. The place there was originally a compelling improvement was in highly-pruned models, but it turns out that if you use Good-Turing estimation in SRILM, then the highly-pruned SRILM models are almost as good as the pocolm ones. Now, Good-Turing doesn't work as well with un-pruned or lightly-pruned models, but in that case you can use SRILM. There is a region in the middle [moderately pruned models] where pocolm is a fair bit better than SRILM's Kneser-Ney or Good-Turing, but that may not be enough to justify the added hassle.

Dan

On Thu, Oct 13, 2016 at 12:45 PM, Rémi Francis notifications@github.com wrote:

It's so this LM can be used with http://www.speech.sri.com/projects/srilm/manpages/hidden-ngram.1.html to add end of sentences.

On 13 October 2016 at 17:34, Daniel Povey notifications@github.com wrote:

Is there a compelling reason why you need that feature? And would it be feasible to merge into medium-length lines, like 100 words?

On Thu, Oct 13, 2016 at 9:24 AM, Rémi Francis notifications@github.com wrote:

I've tried putting my training text on a single line to simulate SRI's continuous-ngram-count, and it worked fine to create an ARPA lm (though I can't split the counts then), however after pruning doesn't appear in the arpa LM ( is still there though), which is annoying in some applications.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/77, or mute the thread https://github.com/notifications/unsubscribe- auth/ADJVu99dBkm9pKlYT-IFr_9P38iveQKDks5qzjD9gaJpZM4KV4dV .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/77#issuecomment-253566937, or mute the thread https://github.com/notifications/unsubscribe-auth/AB- 8ZJiMA3zArHIUCgNzM7Wk35BBG5stks5qzl2cgaJpZM4KV4dV .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/77#issuecomment-253569709, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu-VBSIzNRlJWoIMH-l2iLgUR0YY_ks5qzmAQgaJpZM4KV4dV .

francisr commented 7 years ago

With my current experiments pocolm still seems to be worth it. Do you think that the efficiency can depend on the size of the training set?

Also there is the licensing point of view, I use srilm just for reference, I have another set of tools to do the same things.

On 13 October 2016 at 17:53, Daniel Povey notifications@github.com wrote:

Hm. My feeling is that if it's part of a pipeline involving other SRILM tools, it might be better to stay within the SRILM universe. Our more recent experiments have actually failed to show a super-compelling improvement of pocolm versus SRILM. The place there was originally a compelling improvement was in highly-pruned models, but it turns out that if you use Good-Turing estimation in SRILM, then the highly-pruned SRILM models are almost as good as the pocolm ones. Now, Good-Turing doesn't work as well with un-pruned or lightly-pruned models, but in that case you can use SRILM. There is a region in the middle [moderately pruned models] where pocolm is a fair bit better than SRILM's Kneser-Ney or Good-Turing, but that may not be enough to justify the added hassle.

Dan

On Thu, Oct 13, 2016 at 12:45 PM, Rémi Francis notifications@github.com wrote:

It's so this LM can be used with http://www.speech.sri.com/projects/srilm/manpages/hidden-ngram.1.html to add end of sentences.

On 13 October 2016 at 17:34, Daniel Povey notifications@github.com wrote:

Is there a compelling reason why you need that feature? And would it be feasible to merge into medium-length lines, like 100 words?

On Thu, Oct 13, 2016 at 9:24 AM, Rémi Francis < notifications@github.com> wrote:

I've tried putting my training text on a single line to simulate SRI's continuous-ngram-count, and it worked fine to create an ARPA lm (though I can't split the counts then), however after pruning doesn't appear in the arpa LM ( is still there though), which is annoying in some applications.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/77, or mute the thread https://github.com/notifications/unsubscribe- auth/ADJVu99dBkm9pKlYT-IFr_9P38iveQKDks5qzjD9gaJpZM4KV4dV .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/77#issuecomment-253566937, or mute the thread https://github.com/notifications/unsubscribe-auth/AB- 8ZJiMA3zArHIUCgNzM7Wk35BBG5stks5qzl2cgaJpZM4KV4dV .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/77#issuecomment-253569709, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu-VBSIzNRlJWoIMH- l2iLgUR0YY_ks5qzmAQgaJpZM4KV4dV

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/77#issuecomment-253571902, or mute the thread https://github.com/notifications/unsubscribe-auth/AB-8ZGOEESKaVbTaHwfhoCABCdYSVOltks5qzmH6gaJpZM4KV4dV .

danpovey commented 7 years ago

What kind of perplexity improvements are you seeing versus SRILM, and in what kind of scenario (e.g. how many training sets; how much data; what level of pruning?)

On Thu, Oct 13, 2016 at 1:17 PM, Rémi Francis notifications@github.com wrote:

With my current experiments pocolm still seems to be worth it. Do you think that the efficiency can depend on the size of the training set?

Also there is the licensing point of view, I use srilm just for reference, I have another set of tools to do the same things.

On 13 October 2016 at 17:53, Daniel Povey notifications@github.com wrote:

Hm. My feeling is that if it's part of a pipeline involving other SRILM tools, it might be better to stay within the SRILM universe. Our more recent experiments have actually failed to show a super-compelling improvement of pocolm versus SRILM. The place there was originally a compelling improvement was in highly-pruned models, but it turns out that if you use Good-Turing estimation in SRILM, then the highly-pruned SRILM models are almost as good as the pocolm ones. Now, Good-Turing doesn't work as well with un-pruned or lightly-pruned models, but in that case you can use SRILM. There is a region in the middle [moderately pruned models] where pocolm is a fair bit better than SRILM's Kneser-Ney or Good-Turing, but that may not be enough to justify the added hassle.

Dan

On Thu, Oct 13, 2016 at 12:45 PM, Rémi Francis <notifications@github.com

wrote:

It's so this LM can be used with http://www.speech.sri.com/projects/srilm/manpages/hidden-ngram.1.html to add end of sentences.

On 13 October 2016 at 17:34, Daniel Povey notifications@github.com wrote:

Is there a compelling reason why you need that feature? And would it be feasible to merge into medium-length lines, like 100 words?

On Thu, Oct 13, 2016 at 9:24 AM, Rémi Francis < notifications@github.com> wrote:

I've tried putting my training text on a single line to simulate SRI's continuous-ngram-count, and it worked fine to create an ARPA lm (though I can't split the counts then), however after pruning doesn't appear in the arpa LM ( is still there though), which is annoying in some applications.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/77, or mute the thread https://github.com/notifications/unsubscribe- auth/ADJVu99dBkm9pKlYT-IFr_9P38iveQKDks5qzjD9gaJpZM4KV4dV .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/danpovey/pocolm/issues/77#issuecomment-253566937 , or mute the thread https://github.com/notifications/unsubscribe-auth/AB- 8ZJiMA3zArHIUCgNzM7Wk35BBG5stks5qzl2cgaJpZM4KV4dV .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/77#issuecomment-253569709, or mute the thread https://github.com/notifications/unsubscribe- auth/ADJVu-VBSIzNRlJWoIMH- l2iLgUR0YY_ks5qzmAQgaJpZM4KV4dV

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/77#issuecomment-253571902, or mute the thread https://github.com/notifications/unsubscribe-auth/AB- 8ZGOEESKaVbTaHwfhoCABCdYSVOltks5qzmH6gaJpZM4KV4dV .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/77#issuecomment-253578474, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu0lVeu1Zh8bmZEr3NhMEFm88KeILks5qzmfGgaJpZM4KV4dV .

vince62s commented 7 years ago

btw until recently I didn't know this http://www.speech.sri.com/pipermail/srilm-user/2010q3/000928.html but it works fine. eg. Poco and srilm are in line with the scenario out of domain = cantab text, in-domain = ted corpus.

danpovey commented 7 years ago

In our experiments we did not see that the --prune-history-lm was that helpful, we found it was best to just use Good-Turing LMs throughout. But it could be we did something wrong. Dan

On Thu, Oct 13, 2016 at 4:30 PM, vince62s notifications@github.com wrote:

btw until recently I didn't know this http://www.speech.sri.com/ pipermail/srilm-user/2010q3/000928.html but it works fine. eg. Poco and srilm are in line with the scenario out of domain = cantab text, in-domain = ted corpus.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/77#issuecomment-253629741, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuyqB1V0hnsoUhSFxCqyGQ_K9FrCPks5qzpT7gaJpZM4KV4dV .

francisr commented 7 years ago

I have trained a trigram on one training set with 1.5G words, and I prune it to about 1M ngrams. On the test sets I get: Pocolm gets 153 ppl with 1 310 647 ngrams. Pocolm gets 159 ppl with 1 034 962 ngrams Srilm gets 160 ppl with 1 263 941 ngrams. I haven't tried yet to use multiple train sets.

@vince62s I did some tests with that at some point, IRCC it didn't bring me much improvements with the level of pruning I used.

francisr commented 7 years ago

Btw, when it doesn't print <s> in the arpa file, it still prints the number ngram 1= as if it was there.

danpovey commented 7 years ago

Remi, under what circumstances does it not print the <s> in the unigram section of the arpa file?

And were those SRILM results with Good-Turing smoothing or Kneser-Ney?

On Fri, Oct 14, 2016 at 6:03 AM, Rémi Francis notifications@github.com wrote:

Btw, when it doesn't print in the arpa file, it still prints the number ngram 1= as if it was there.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/77#issuecomment-253759726, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuw-Rb7xULYdFy3ekIYlWTBjAGrcqks5qz1NngaJpZM4KV4dV .

francisr commented 7 years ago

It's when I have the whole training text on one line, and then prune the lm.
The srilm results are with Good-Turing.

danpovey commented 7 years ago

Can you please see if that PR fixes the issue? It will only be necessary to re-run format_arpa_lm.py or whatever it's called, after compiling.