danpovey / pocolm

Small language toolkit for creation, interpolation and pruning of ARPA language models
Other
91 stars 48 forks source link

Scope of this project #5

Closed vince62s closed 8 years ago

vince62s commented 8 years ago

Hi Dan,

I have a few questions on the scope of this project. I understand this is merely an LM creation tools with Bells and whistles to optimize the perplexity and such.

I have 2 major points that have a big impact on the quality of the LM : preprocessing and vocab size.

For the latter: In your previous kaldi_lm toolkit one had to produce before hand a wordlist.txt file. I am not sure if others could be interested but I think producing a dictionary based on the training text (with either a count threshold or a vocab size limit) could be an easy trick to make the process easier.

The former point is more difficult: preprocessing. Many papers refer to "standard ASR pipeline for preprocessing" such as either lowercase or uppercase, remove punctuation, expand numbers into digit words, ... but there is no actual implementation of this even though it has a big impact on the LM preparation.

If this is pointless, just close the issue. Cheers. Vincent.

nshmyrev commented 8 years ago

There is https://github.com/google/sparrowhawk which is promising but not complete.

danpovey commented 8 years ago

Yes- regarding preprocessing, the toolkit does not really do that. Regarding preparing the wordlist-- actually we do make that easier. In https://github.com/danpovey/pocolm/blob/master/egs/swbd_fisher/run.sh see the line

decide on the vocabulary.

counts_to_vocab.py --num-words=40000 data/word_counts > data/vocab_40k.txt

This uses a weighted combination of the different provided datasets, with weights estimated to maximize the dev-data perplexity as for a unigram LM. So it will do something reasonable, taking into account how close the different datasets are to your dev data.

There is also a way to use a user-provided wordlist (wordlist_to_vocab.py).

Dan

On Wed, May 18, 2016 at 12:49 PM, Nickolay V. Shmyrev < notifications@github.com> wrote:

There is https://github.com/google/sparrowhawk which is promising but not complete.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-220088616

vince62s commented 8 years ago

Hi Dan, Do you have time to write the 3 lines on how to get going for compilation and usage ? I would like to test from the Cantab corpus and build the LM to see if improves the decoding on the tedlium project. Thanks. Edit: actually I think I'll figure it out myself.

danpovey commented 8 years ago

That would be cool. I just pushed a change to the README.md. It has not been compiled on many platforms so we may need to resolve problems that arise.

On Fri, May 27, 2016 at 7:08 AM, vince62s notifications@github.com wrote:

Hi Dan, Do you have time to write the 3 lines on how to get going for compilation and usage ? I would like to test from the Cantab corpus and build the LM to see if improves the decoding on the tedlium project. Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-222120561, or mute the thread https://github.com/notifications/unsubscribe/ADJVu0Zla-cGiOVC4NJkJ_TrmM7C3j1Zks5qFtCUgaJpZM4IhAmb .

vince62s commented 8 years ago

ah... I'll have an issue. What is your suggestion in terms of pruning. The new script is not done yet. When I used to use kaldi_lm tools and the pruning script I had to use a huge threshold (I think 20 or so) to get a pruned order 3 for which the gz size was around 30MB. any insight on this ?

danpovey commented 8 years ago

I have not finished the pruning part of pocolm, although I expect the pruning, when done, to be very good (i.e. give you smaller LMs at the same perplexity than something like SRILM, and even than kaldi_lm, which was better than SRILM). For now, maybe best to test it via rescoring with const-arpa-lm.

I will eventually set up the pruning in pocolm so that you can specify the desired size of the LM. The fact that you had to use a threshold with kaldi_lm is not unexpected. The threshold is defined differently than it is in SRILM.

On Fri, May 27, 2016 at 4:43 PM, vince62s notifications@github.com wrote:

ah... I'll have an issue. What is your suggestion in terms of pruning. The new script is not done yet. When I used to use kaldi_lm tools and the pruning script I had to use a huge threshold (I think 20 or so) to get a pruned order 3 for which the gz size was around 30MB. any insight on this ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-222248761, or mute the thread https://github.com/notifications/unsubscribe/ADJVu8i2B8b8MFbzkGjMkAin8riECvdfks5qF1eHgaJpZM4IhAmb .

vince62s commented 8 years ago

OK, first it took really long to build the LMs with the script, generated 122GB of data ....(for 0.9GB of text). To make it comparable I used a 150k vocab size. I had to remove the from the cantab-Tedlium.txt file. the unpruned order 3 and 4 LM is about the same size as the Cantab one. Number of 2-grams 3-grams also same range.

Then I rescored the already decoded lattices from my tedlium run. Below are comparison : 1st instance = cantab LM4, 2nd instance = pocolm built LM4 Bottom line, no improvement, but tends to be be close.

Do you know how they built their LM at Cantab ?

WER 11.1 | 1155 27512 | 90.4 6.8 2.8 1.5 11.1 76.4 | -0.147 | exp/nnet2_online/nnet_ms_sp/decode_test.rescore/score_11_0.0/ctm.filt.filt.sys WER 11.7 | 1155 27512 | 89.8 7.4 2.8 1.6 11.7 77.9 | -0.397 | exp/nnet2_online/nnet_ms_sp/decode_test.rescore/score_11_0.0/ctm.filt.filt.sys 56c56 WER 11.0 | 1155 27512 | 90.5 6.7 2.7 1.6 11.0 76.3 | -0.268 | exp/nnet2_online/nnet_ms_sp_online/decode_test.rescore/score_11_0.0/ctm.filt.filt.sys WER 11.1 | 1155 27512 | 90.3 6.8 2.9 1.5 11.1 77.1 | -0.365 | exp/nnet2_online/nnet_ms_sp_online/decode_test.rescore/score_12_0.0/ctm.filt.filt.sys 80c80 WER 10.5 | 1155 27512 | 91.0 6.2 2.7 1.5 10.5 74.7 | -0.260 | exp/nnet2_online/nnet_ms_sp_smbr_0.000005/decode_epoch4_test.rescore/score_10_0.0/ctm.filt.filt.sys WER 10.6 | 1155 27512 | 91.0 6.3 2.7 1.5 10.6 74.7 | -0.386 | exp/nnet2_online/nnet_ms_sp_smbr_0.000005/decode_epoch4_test.rescore/score_10_0.0/ctm.filt.filt.sys

danpovey commented 8 years ago

Hm. It might be better to use the exact vocabulary, by extracting a wordlist from words.txt in lang_test and using wordlist_to_vocab.py. The vocab that they used (and that we used in the dictionary) might not be the same as the top 150k words. What dev data did you use? If it was a subset of the tedlium data (as I imagine it was), then you'll probably want to use the option --fold-dev-into=train to make_lm_dir.py, to avoid losing that data. Normally I'd say that they used SRILM, but I notice the 1-gram prob of BOS is not -99, so it must be a different toolkit. Pocolm will tend to take a while to build an LM the first time, but I'm going to add a way to short-circuit the metaparameter estimation so that after someone has built it initially, re-building it is fast. Dan

On Sat, May 28, 2016 at 9:07 AM, vince62s notifications@github.com wrote:

OK, first it took really long to build the LMs with the script. to make it comparable I used a 150k vocab size. I had to remove the from the cantab-Tedlium.txt file. the unpruned order 3 and 4 LM is about the same size as the Cantab one. Number of 2-grams 3-grams also same range.

Then I rescored the already decoded lattices from my tedlium run. Below are comparison : 1st instance = cantab LM4, 2nd instance = pocolm built LM4 Bottom line, no improvement, but tends to be be close.

Do you know they built their LM at Cantab ? < %WER 11.1 | 1155 27512 | 90.4 6.8 2.8 1.5 11.1 76.4 | -0.147 | exp/nnet2_online/nnet_ms_sp/decode_test.rescore/score_11_0.0/ctm.filt.filt.sys

%WER 11.7 | 1155 27512 | 89.8 7.4 2.8 1.6 11.7 77.9 | -0.397 | exp/nnet2_online/nnet_ms_sp/decode_test.rescore/score_11_0.0/ctm.filt.filt.sys 56c56 < %WER 11.0 | 1155 27512 | 90.5 6.7 2.7 1.6 11.0 76.3 | -0.268 | exp/nnet2_online/nnet_ms_sp_online/decode_test.rescore/score_11_0.0/ctm.filt.filt.sys

%WER 11.1 | 1155 27512 | 90.3 6.8 2.9 1.5 11.1 77.1 | -0.365 | exp/nnet2_online/nnet_ms_sp_online/decode_test.rescore/score_12_0.0/ctm.filt.filt.sys 80c80 < %WER 10.5 | 1155 27512 | 91.0 6.2 2.7 1.5 10.5 74.7 | -0.260 | exp/nnet2_online/nnet_ms_sp_smbr_0.000005/decode_epoch4_test.rescore/score_10_0.0/ctm.filt.filt.sys

%WER 10.6 | 1155 27512 | 91.0 6.3 2.7 1.5 10.6 74.7 | -0.386 | exp/nnet2_online/nnet_ms_sp_smbr_0.000005/decode_epoch4_test.rescore/score_10_0.0/ctm.filt.filt.sys

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-222307688, or mute the thread https://github.com/notifications/unsubscribe/ADJVu7Yed-7NeXdVlAf9Lerm8MpZHZ4xks5qGD4EgaJpZM4IhAmb .

danpovey commented 8 years ago

Also, regarding the dev data- if you used a subset of cantab-TEDLIUM.txt, make sure you excluded that from the training data.

On Sat, May 28, 2016 at 2:25 PM, Daniel Povey dpovey@gmail.com wrote:

Hm. It might be better to use the exact vocabulary, by extracting a wordlist from words.txt in lang_test and using wordlist_to_vocab.py. The vocab that they used (and that we used in the dictionary) might not be the same as the top 150k words. What dev data did you use? If it was a subset of the tedlium data (as I imagine it was), then you'll probably want to use the option --fold-dev-into=train to make_lm_dir.py, to avoid losing that data. Normally I'd say that they used SRILM, but I notice the 1-gram prob of BOS is not -99, so it must be a different toolkit. Pocolm will tend to take a while to build an LM the first time, but I'm going to add a way to short-circuit the metaparameter estimation so that after someone has built it initially, re-building it is fast. Dan

On Sat, May 28, 2016 at 9:07 AM, vince62s notifications@github.com wrote:

OK, first it took really long to build the LMs with the script. to make it comparable I used a 150k vocab size. I had to remove the from the cantab-Tedlium.txt file. the unpruned order 3 and 4 LM is about the same size as the Cantab one. Number of 2-grams 3-grams also same range.

Then I rescored the already decoded lattices from my tedlium run. Below are comparison : 1st instance = cantab LM4, 2nd instance = pocolm built LM4 Bottom line, no improvement, but tends to be be close.

Do you know they built their LM at Cantab ? < %WER 11.1 | 1155 27512 | 90.4 6.8 2.8 1.5 11.1 76.4 | -0.147 | exp/nnet2_online/nnet_ms_sp/decode_test.rescore/score_11_0.0/ctm.filt.filt.sys

%WER 11.7 | 1155 27512 | 89.8 7.4 2.8 1.6 11.7 77.9 | -0.397 | exp/nnet2_online/nnet_ms_sp/decode_test.rescore/score_11_0.0/ctm.filt.filt.sys 56c56 < %WER 11.0 | 1155 27512 | 90.5 6.7 2.7 1.6 11.0 76.3 | -0.268 | exp/nnet2_online/nnet_ms_sp_online/decode_test.rescore/score_11_0.0/ctm.filt.filt.sys

%WER 11.1 | 1155 27512 | 90.3 6.8 2.9 1.5 11.1 77.1 | -0.365 | exp/nnet2_online/nnet_ms_sp_online/decode_test.rescore/score_12_0.0/ctm.filt.filt.sys 80c80 < %WER 10.5 | 1155 27512 | 91.0 6.2 2.7 1.5 10.5 74.7 | -0.260 | exp/nnet2_online/nnet_ms_sp_smbr_0.000005/decode_epoch4_test.rescore/score_10_0.0/ctm.filt.filt.sys

%WER 10.6 | 1155 27512 | 91.0 6.3 2.7 1.5 10.6 74.7 | -0.386 | exp/nnet2_online/nnet_ms_sp_smbr_0.000005/decode_epoch4_test.rescore/score_10_0.0/ctm.filt.filt.sys

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-222307688, or mute the thread https://github.com/notifications/unsubscribe/ADJVu7Yed-7NeXdVlAf9Lerm8MpZHZ4xks5qGD4EgaJpZM4IhAmb .

danpovey commented 8 years ago

Actually, they could have used IRSTLM to build the Cantab LM. I have a feeling IRSTLM requires the EOS symbols in the text, which would explain why they appear. If you could compare the dev-data perplexities with IRSTLM versus pocolm, that would be constructive. Of course, doing the same with SRILM would be good too. Dan

On Sat, May 28, 2016 at 2:29 PM, Daniel Povey dpovey@gmail.com wrote:

Also, regarding the dev data- if you used a subset of cantab-TEDLIUM.txt, make sure you excluded that from the training data.

On Sat, May 28, 2016 at 2:25 PM, Daniel Povey dpovey@gmail.com wrote:

Hm. It might be better to use the exact vocabulary, by extracting a wordlist from words.txt in lang_test and using wordlist_to_vocab.py. The vocab that they used (and that we used in the dictionary) might not be the same as the top 150k words. What dev data did you use? If it was a subset of the tedlium data (as I imagine it was), then you'll probably want to use the option --fold-dev-into=train to make_lm_dir.py, to avoid losing that data. Normally I'd say that they used SRILM, but I notice the 1-gram prob of BOS is not -99, so it must be a different toolkit. Pocolm will tend to take a while to build an LM the first time, but I'm going to add a way to short-circuit the metaparameter estimation so that after someone has built it initially, re-building it is fast. Dan

On Sat, May 28, 2016 at 9:07 AM, vince62s notifications@github.com wrote:

OK, first it took really long to build the LMs with the script. to make it comparable I used a 150k vocab size. I had to remove the from the cantab-Tedlium.txt file. the unpruned order 3 and 4 LM is about the same size as the Cantab one. Number of 2-grams 3-grams also same range.

Then I rescored the already decoded lattices from my tedlium run. Below are comparison : 1st instance = cantab LM4, 2nd instance = pocolm built LM4 Bottom line, no improvement, but tends to be be close.

Do you know they built their LM at Cantab ? < %WER 11.1 | 1155 27512 | 90.4 6.8 2.8 1.5 11.1 76.4 | -0.147 | exp/nnet2_online/nnet_ms_sp/decode_test.rescore/score_11_0.0/ctm.filt.filt.sys

%WER 11.7 | 1155 27512 | 89.8 7.4 2.8 1.6 11.7 77.9 | -0.397 | exp/nnet2_online/nnet_ms_sp/decode_test.rescore/score_11_0.0/ctm.filt.filt.sys 56c56 < %WER 11.0 | 1155 27512 | 90.5 6.7 2.7 1.6 11.0 76.3 | -0.268 | exp/nnet2_online/nnet_ms_sp_online/decode_test.rescore/score_11_0.0/ctm.filt.filt.sys

%WER 11.1 | 1155 27512 | 90.3 6.8 2.9 1.5 11.1 77.1 | -0.365 | exp/nnet2_online/nnet_ms_sp_online/decode_test.rescore/score_12_0.0/ctm.filt.filt.sys 80c80 < %WER 10.5 | 1155 27512 | 91.0 6.2 2.7 1.5 10.5 74.7 | -0.260 | exp/nnet2_online/nnet_ms_sp_smbr_0.000005/decode_epoch4_test.rescore/score_10_0.0/ctm.filt.filt.sys

%WER 10.6 | 1155 27512 | 91.0 6.3 2.7 1.5 10.6 74.7 | -0.386 | exp/nnet2_online/nnet_ms_sp_smbr_0.000005/decode_epoch4_test.rescore/score_10_0.0/ctm.filt.filt.sys

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-222307688, or mute the thread https://github.com/notifications/unsubscribe/ADJVu7Yed-7NeXdVlAf9Lerm8MpZHZ4xks5qGD4EgaJpZM4IhAmb .

vince62s commented 8 years ago

ok I am a bit confused by the first 2 posts. Yes I did take your script as is, which means it takes 10000 first lines to make the dev set, and it removes it from the training data. So I gues it answers the second post. regarding the first one, I do understand the vocab thing but about the option -fold-dev-into-train is this reinjecting the dev set into the training data ?

danpovey commented 8 years ago

ok I am a bit confused by the first 2 posts. Yes I did take your script as is, which means it takes 10000 first lines to make the dev set, and it removes it from the training data. So I gues it answers the second post. regarding the first one, I do understand the vocab thing but about the option -fold-dev-into-train is this reinjecting the dev set into the training data ?

Yes. Actually this probably won't make much of a difference because the amount of dev data is relatively small. But check whether the vocabulary you get from the top 150k words is the same as the vocab from words.txt in the lang directory. It's possible they are not the same.

Dan

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-222324578, or mute the thread https://github.com/notifications/unsubscribe/ADJVuxThh8J-oyr7WLpXKkx0B5WV0wXJks5qGJHRgaJpZM4IhAmb .

vince62s commented 8 years ago

yes vocabs are a little different (non alpha stuff like # $ and numbers also), I could adapt your scripts to take only the cantab dict and get the counts and all stuff but instead what I did is to increase slightly the vocab size of the pocolm (160K) to make sure I cover more words that could be in the cantab vocab. It did decrease the WER to 10.5 (so it's a match) on the epoch4 one. I'll make some other tests.

danpovey commented 8 years ago

It would be great if you could help with comparing perplexities against IRSTLM/SRILM language models, using the exact same vocabulary. If you can help with this type of stuff I can add you to the paper when we publish it.

In order to use a pre-existing wordlist, you'd replace the two lines:

get_word_counts.py data/text data/word_counts

counts_to_vocab.py --num-words=20000 data/word_counts > data/vocab_20k.txt

with something like:

wordlist_to_vocab.py existing_wordlist.txt > data/vocab.txt

where the existing wordlist has just a word on each line.

Dan

On Sun, May 29, 2016 at 9:18 AM, vince62s notifications@github.com wrote:

yes vocabs are a little different (non alpha stuff like # $ and numbers also), I could adapt your scripts to take only the cantab dict and get the counts and all stuff but instead what I did is to increase slightly the vocab size of the pocolm (160K) to make sure I cover more words that could be in the cantab vocab. It did decrease the WER to 10.5 (so it's a match) on the epoch4 one. I'll make some other tests.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-222360193, or mute the thread https://github.com/notifications/unsubscribe/ADJVuwZdfeVI8yLQ1L5Oymlb_BGvB8Eiks5qGZILgaJpZM4IhAmb .

vince62s commented 8 years ago

I'll try to do this but I think they did not use IRSTLM only, this link might be what they did unless I am mistaken: https://arxiv.org/pdf/1312.3005.pdf I was limiting my run to order 3 and 4 for time constraints but I'll try until order 5 to compare to this paper. However I am wondering if I should stick to a dev set of 10000 sentences given the corpus of 7.8 million sentences. They seem to have retained 1% of heldout.

danpovey commented 8 years ago

I very much doubt they used any methods from the paper. IRSTLM would be the first thing I would try; if that fails, maybe KenLM. You could also try using the actual Tedlium training data as dev data. Dan

On Mon, May 30, 2016 at 4:15 AM, vince62s notifications@github.com wrote:

I'll try to do this but I think they did not use IRSTLM only, this link might be what they did unless I am mistaken: https://arxiv.org/pdf/1312.3005.pdf I was limiting my run to order 3 and 4 for time constraints but I'll try until order 5 to compare to this paper. However I am wondering if I should stick to a dev set of 10000 sentences given the corpus of 7.8 million sentences. They seem to have retained 1% of heldout.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-222438065, or mute the thread https://github.com/notifications/unsubscribe/ADJVu7KEPqyWnmif3_vkzL8HefHEFYeFks5qGpytgaJpZM4IhAmb .

vince62s commented 8 years ago

I am stupid, it's in the readme file : https://arxiv.org/pdf/1502.00512v1.pdf I ran ppl versus srilm: order 3: pocolm=122.637 srilm=126.48 order 4: pocolm=105.229 srilm=112.42 order 5: pocolm=100.084 srilm=109.615 At this point I only did it on the top 160k words for both and with the dev set being the first 10k sentences. Of course I did not fold dev into train for pocolm.

I will try Kenlm first because I also tried the cantab LM with Moses and surpringly it gave worse results than the same order kenlm built model (however was built on news2014-shuffle corpus which is unfiltered), but still I expected the cantab to be a litlle bit better.

danpovey commented 8 years ago

On Mon, May 30, 2016 at 4:10 PM, vince62s notifications@github.com wrote:

I am stupid, it's in the readme file : https://arxiv.org/pdf/1502.00512v1.pdf I ran ppl versus srilm: order 3: pocolm=122.637 srilm=126.48 order 4: pocolm=105.229 srilm=112.42 order 5: pocolm=100.084 srilm=109.615 At this point I only did it on the top 160k words for both and with the dev set being the first 10k sentences. Of course I did not fold dev into train for pocolm.

Be careful when measuring perplexities using SRILM. If the order of the LM is greater than 3, you need to specify an option like '-order 4' or '-order 5' to 'ngram'. The perplexity change for order 3 is possibly believable, but not for the greater orders

Dan

I will try Kenlm first because I also tried the cantab LM with Moses and surpringly it gave worse results than the same order kenlm built model (however was built on news2014-shuffle corpus which is unfiltered), but still I expected the cantab to be a litlle bit better.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-222548812, or mute the thread https://github.com/notifications/unsubscribe/ADJVu1XRvRw27kei0G4Jm6Vyi71PQaPVks5qG0QrgaJpZM4IhAmb .

danpovey commented 8 years ago

Also, make sure you specify the '-unk' and related flags to SRILM where appropriate.. best to take all the options from the srilm_baseline script for the Switchboard example in pocolm.

Dan

On Mon, May 30, 2016 at 4:13 PM, Daniel Povey dpovey@gmail.com wrote:

On Mon, May 30, 2016 at 4:10 PM, vince62s notifications@github.com wrote:

I am stupid, it's in the readme file : https://arxiv.org/pdf/1502.00512v1.pdf I ran ppl versus srilm: order 3: pocolm=122.637 srilm=126.48 order 4: pocolm=105.229 srilm=112.42 order 5: pocolm=100.084 srilm=109.615 At this point I only did it on the top 160k words for both and with the dev set being the first 10k sentences. Of course I did not fold dev into train for pocolm.

Be careful when measuring perplexities using SRILM. If the order of the LM is greater than 3, you need to specify an option like '-order 4' or '-order 5' to 'ngram'. The perplexity change for order 3 is possibly believable, but not for the greater orders

Dan

I will try Kenlm first because I also tried the cantab LM with Moses and surpringly it gave worse results than the same order kenlm built model (however was built on news2014-shuffle corpus which is unfiltered), but still I expected the cantab to be a litlle bit better.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-222548812, or mute the thread https://github.com/notifications/unsubscribe/ADJVu1XRvRw27kei0G4Jm6Vyi71PQaPVks5qG0QrgaJpZM4IhAmb .

vince62s commented 8 years ago

that's what I did, taking exactly what's in the srilm_baseline. kenlm order 3 just gave me 124.35 for kenlm the only thing I am not sure about is to use --interpolate_unigrams 0 to match the srilm method or the default value (1 I think).

danpovey commented 8 years ago

Hm, OK. Try measuring the perplexity with the Cantab LM you downloaded, also. If you could run all these experiments, with the exact vocab from the Cantab LM (e.g. the wordlist extracted from its unigram section), that would be helpful too. And at some point we'll need WERs based on rescoring. (but don't do that just yet, as it might be slow).

Dan

On Mon, May 30, 2016 at 5:00 PM, vince62s notifications@github.com wrote:

that's what I did, taking exactly what's in the srilm_baseline. kenlm order 3 just gave me 124.35 for kenlm the only thing I am not sure about is to use --interpolate_unigrams 0 to match the srilm method or the default value (1 I think).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-222553944, or mute the thread https://github.com/notifications/unsubscribe/ADJVuwOBWQIOlHV73iD-sJO50goJeTJ2ks5qG0_ZgaJpZM4IhAmb .

vince62s commented 8 years ago

Well I need clarification here. When I run these experiments, obviously it take the all corpus, take out 10k sentences for the dev set, and then the remaining is used to build the LM (poco, srilm, kenlm). If I take the same dev set and run the ppl on the existing cantab LM, I won't have the 10k sentences out of the LM, so I would need to compare with another run which fold the dev set into the train. Am I right ?

danpovey commented 8 years ago

You're right. You could try using the test set for TEDLIUM (i.e. data/test/text) to check the perplexities of all of them (while still using the 10k sentences for pocolm for metaparameter estimation). The TEDLIUM test set is guaranteed to not be in the LM training data. You can measure them all using SRILM (don't forget the -order flag, and probably the -unk flag, and -map-unk, whatever I use in the example script) from the ARPA-format outputs. That will ensure that the perplexities are computed in exactly the same way. Don't forget to remove the utterance-ids.

Dan

On Mon, May 30, 2016 at 5:24 PM, vince62s notifications@github.com wrote:

Well I need clarification here. When I run these experiments, obviously it take the all corpus, take out 10k sentences for the dev set, and then the remaining is used to build the LM (poco, srilm, kenlm). If I take the same dev set and run the ppl on the existing cantab LM, I won't have the 10k sentences out of the LM, so I would need to compare with another run which fold the dev set into the train. Am I right ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-222556209, or mute the thread https://github.com/notifications/unsubscribe/ADJVu2MpGlNNNMNfaPuoIof06p3kAWK3ks5qG1V6gaJpZM4IhAmb .

vince62s commented 8 years ago

order 3: pocolm=122.637 kenlm=124.35 srilm=126.48 order 4: pocolm=105.229 kenlm=107.558 srilm=112.42 order 5: pocolm=100.084 kenlm=102.783 srilm=109.615

will continue tomorrow and retry kenlm without --interpolate_unigrams 0 but so far I confirm, all comparable.

danpovey commented 8 years ago

I'm surprised kenlm and srilm differ so much; if they are supposed to be running the same algorithm (presumably modified Kneser-Ney with interpolation), they should be giving the same result. When you have time, show me the command line used to estimate the SRILM LM, and to evaluate the probability. [i.e. the resultant command line, free of bash variables.]

On Mon, May 30, 2016 at 5:40 PM, vince62s notifications@github.com wrote:

order 3: pocolm=122.637 kenlm=124.35 srilm=126.48 order 4: pocolm=105.229 kenlm=107.558 srilm=112.42 order 5: pocolm=100.084 kenlm=102.783 srilm=109.615

will continue tomorrow and retry kenlm without --interpolate_unigrams 0 but so far I confirm, all comparable.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-222557575, or mute the thread https://github.com/notifications/unsubscribe/ADJVu3R2QKsrjkpWRbkzCA1fSUvCS-Tzks5qG1kxgaJpZM4IhAmb .

vince62s commented 8 years ago

I need to re-run the tests and you may need to adjust your script too. It seems thats SRILM's default setting is to prune trigrams and above, so we need to turn off pruning in SRILM and use -gt3min 1. Also in Poco how do allocate p(unk) ? (compared to what KenLM does) because the impact of --interpolate_unigrams 0 is to increase the p(unk) https://kheafield.com/code/kenlm/estimation/ reruning and keep you posted.

francisr commented 8 years ago

I can't find the exact scripts used for the cantab LMs, but here is some info:

The cantab trigram was most likely built with HTK (with explicit in the text), I don't think that a cutoff was used.

The 4gram was probably built with KenLM, git version f2889133b463255e7e4769b4ce2eb8058a524331 and with --interpolate_unigrams. That version of KenLM didn't allow to use a predefined vocab, so all the oovs were mapped into a special symbol different to (like UNKUNKUNK), and then after training the LM, it would get merged with .

francisr commented 8 years ago

I have a question about pruning in Pocolm: the goal seems to train ngrams with something close to modified KN, and then prune them based on a target set. However, my understanding is that KN ngrams react badly to pruning, for example in the billion words corpus paper, the small Katz ngram is better than the small KN one. Is it something you will deal with in your that toolkit?

vince62s commented 8 years ago

with the -gtnmin 1 values for Srilm, makes much more sense now ....I am attaching the scripts. order 3: pocolm=122.637 kenlm=124.35 srilm=123.764 order 4: pocolm=105.229 kenlm=107.558 srilm=107.029 order 5: pocolm=100.084 kenlm=102.783 srilm=102.269 lm-benchmark.txt

vince62s commented 8 years ago

@francisr where did you get your info ? the read me file of the cantab tarball explicitly refers to that paper : https://arxiv.org/pdf/1502.00512v1.pdf

vince62s commented 8 years ago

I recalculated your baseline for swbd too: notes on SRILM baselines, from local/srilm_baseline.sh: 3-gram: ppl= 84.0165 (was 84.6115) 4-gram: ppl= 82.5541 (was 82.9717)

francisr commented 8 years ago

@vince62s I work there, so I've asked around for some information. The person who trained the ngrams left a while ago, so I wasn't able to get the exact details, but this is what I was able to gather.

vince62s commented 8 years ago

Next experiment is based on your reco of last night. I am using the text of the Cantab-TEDLIUM test set as the basis for ppl calculation (text from which I removed utterance-ids). There are "only" 1155 sentences for 27522 words. Running this on Cantab LMs order3 : 227.358 / order 4: 185.946 Then I decided to run the PPL against the already built LMs from Poco and KenLM (only difference being the 10k sentences taken out from my previous run) order 3 : poco 189.237 / kenlm 195.347 order 4 : poco 177.179 / kenlm 186.596 order 5 : poco 175.351 / kenlm 184.871 Then I told myself (can the 10k sentences make a difference ? since it takes 11 hours for poco to run I decided to rebuild the kenlm with full corpus) Well slightly order 3 kenlM 194.929, order 4 kenlm 186.18, order 5 kenlm 184.459

That could validate that Cantab LM4 was built on KenLM.

danpovey commented 8 years ago

I'll respond to the other things later, but regarding the LM pruning and whether it reacts well to KN vs Katz... this won't be an issue as we're not using the normal pruning method-- I'm using a more advanced pruning method that re-estimates the LM parameters as it prunes.

danpovey commented 8 years ago

OK- thanks, it looks like there has been good progress. Can I remind people that it's quite possible to submit pull requests to a personal repository? It would be very helpful to get pull requests with some of these fixes and experiments.

@vince62s: you say it took 11 hours... I'm surprised it took so long. Are you on a multi-core machine? Which stage takes a long time? There are two calls to optimize_metaparameters.py in the switchboard example script; which of those two calls takes a long time? You could increase ratio=10 to ratio=20 which would double the speed of the first of those two calls.

Dan

On Tue, May 31, 2016 at 11:04 AM, vince62s notifications@github.com wrote:

Next experiment is based on your reco of last night. I am using the text of the Cantab-TEDLIUM test set as the basis for ppl calculation (text from which I removed utterance-ids). There are "only" 1155 sentences for 27522 words. Running this on Cantab LMs order3 : 227.358 / order 4: 185.946 Then I decided to run the PPL against the already built LMs from Poco and KenLM (only difference being the 10k sentences taken out from my previous run) order 3 : poco 189.237 / kenlm 195.347 order 4 : poco 177.179 / kenlm 186.596 order 5 : poco 175.351 / kenlm 184.871 Then I told myself (can the 10k sentences make a difference ? since it takes 11 hours for poco to run I decided to rebuild the kenlm with full corpus) Well slightly order 3 kenlM 194.929, order 4 kenlm 186.18, order 5 kenlm 184.459

That could validate that Cantab LM4 was built on KenLM.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-222717477, or mute the thread https://github.com/notifications/unsubscribe/ADJVu2P_hOUCUAh-RflYMHZPC69vFwywks5qHE4UgaJpZM4IhAmb .

danpovey commented 8 years ago

Also @vince62s, regarding the speed: is it possible you're on a slow network drive and I/O is limiting the speed ? Do 'top' and see if you're maxing out the CPU.. if your processes are getting 100% CPU, or the sum-total of the %CPU equals the number of virtual cores in the machine (number of 'processor' lines in /proc/cpuinfo) then it's not I/O limited.

vince62s commented 8 years ago

I'll try to make PR, again not very used to this. I have 20 cores (xeon 2.93 Ghz) on this machine and drive was local for this task. Based on the time stamp of folders, it's the second call which take most time, about 3 hours per Order. When I rerun it, I'll try to check but if I recall not all cores were being used.

danpovey commented 8 years ago

You could also increase num-splits to 10. I'll have to look into why this is slow. It's pretty fast for Switchboard (<10 minutes) but you have much more data I guess. Dan

On Tue, May 31, 2016 at 4:55 PM, vince62s notifications@github.com wrote:

I'll try to make PR, again not very used to this. I have 20 cores (xeon 2.93 Ghz) on this machine and drive was local for this task. Based on the time stamp of folders, it's the second call which take most time, about 3 hours per Order. When I rerun it, I'll try to check but if I recall not all cores were being used.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-222817607, or mute the thread https://github.com/notifications/unsubscribe/ADJVu4BmS-vLXTKmFo9kMzGoeN1n-51Tks5qHKBIgaJpZM4IhAmb .

danpovey commented 8 years ago

Oh, if you had git "git pull" and updated the scripts without recompiling the code, there was a change at one point that would have made the "splits" ineffective, so it wouldn't have been using 5 processes but just one. Dan

On Tue, May 31, 2016 at 5:17 PM, Daniel Povey dpovey@gmail.com wrote:

You could also increase num-splits to 10. I'll have to look into why this is slow. It's pretty fast for Switchboard (<10 minutes) but you have much more data I guess. Dan

On Tue, May 31, 2016 at 4:55 PM, vince62s notifications@github.com wrote:

I'll try to make PR, again not very used to this. I have 20 cores (xeon 2.93 Ghz) on this machine and drive was local for this task. Based on the time stamp of folders, it's the second call which take most time, about 3 hours per Order. When I rerun it, I'll try to check but if I recall not all cores were being used.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-222817607, or mute the thread https://github.com/notifications/unsubscribe/ADJVu4BmS-vLXTKmFo9kMzGoeN1n-51Tks5qHKBIgaJpZM4IhAmb .

danpovey commented 8 years ago

Scratch this comment. I made a mistake, that wouldn't have happened. It should be using as many processes as there are splits-- 5 in the switchboard script.

Oh, if you had git "git pull" and updated the scripts without recompiling the code, there was a change at one point that would have made the "splits" ineffective, so it wouldn't have been using 5 processes but just one."

francisr commented 8 years ago

Oh I see, it seems quite interesting. Do you plan to compare your way of adapting the LM to entropy filtering? I'd also like to give Pocolm a go, is it in good shape now or do you recommend waiting a bit?

danpovey commented 8 years ago

Oh I see, it seems quite interesting. Do you plan to compare your way of adapting the LM to entropy filtering?

Yes we do plan to make that comparison.

I'd also like to give Pocolm a go, is it in good shape now or do you recommend waiting a bit?

The pruning part of pocolm is not finished yet (wait a few days). However, if you have a situation where you have multiple source databases and want to optimize the weights as well as estimate the LM, pocolm should give you an advantage over SRILM, and I'd be interested to hear how it works. Dan

francisr commented 8 years ago

I do have many different sources, with very different amounts of data and quality (in terms of how it matches the use case, here ASR). I've seen that you can put multiple sources in data/text, but you need a single dev.txt. How should I choose it? For the training data do I have to have only one text file for each source? What does pocolm do with multiple sources that would be better than SRILM?

Also I was wondering if there will be options to have behaviour similar to SRILM's continuous-ngram-count, and -no-eos -no-sos?

danpovey commented 8 years ago

I do have many different sources, with very different amounts of data and quality (in terms of how it matches the use case, here ASR). I've seen that you can put multiple sources in data/text, but you need a single dev.txt. How should I choose it?

dev.txt should be some data that matches the domain you are interested in decoding.

For the training data do I have to have only one text file for each source?

Yes-- if you have separate text files, it would assign them separate weights, and it would get slow if you have too many separate text files due to too many weights to estimate.

What does pocolm do with multiple sources that would be better than SRILM?

It has a method of discounting where it combines all the data before discounting, instead of discounting first and then combining. This should be more optimal. Even for a single data source, the perplexities are better.

Also I was wondering if there will be options to have behaviour similar to SRILM's continuous-ngram-count, and -no-eos -no-sos?

Can I ask why you want this? It could perhaps be added in future, but I want to know the use case.

francisr commented 8 years ago

I'm experimenting with utterances that are not complete sentences or multiple sentences, and with keeping punctuation in the LM, so with words like , , etc.

vince62s commented 8 years ago

ah.... we see the real user, as a matter of fact I also use "hidden-ngram" for punctuation restoration and "disambig" for case restoration. both are a real need, for sure.

danpovey commented 8 years ago

Hm. This will become a question of how big the scope of the project should be. We'll see.

On Mon, Jun 6, 2016 at 10:21 AM, vince62s notifications@github.com wrote:

ah.... we see the real user, as a matter of fact I also use "hidden-ngram" for punctuation restoration and "disambig" for case restoration. both are a real need, for sure.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/5#issuecomment-223973332, or mute the thread https://github.com/notifications/unsubscribe/ADJVuzqiguthik6BUgDWynd2hdTXiUutks5qJCzZgaJpZM4IhAmb .