danpovey / pocolm

Small language toolkit for creation, interpolation and pruning of ARPA language models
Other
90 stars 48 forks source link

Using train_lm.py in Kaldi #47

Open danpovey opened 8 years ago

danpovey commented 8 years ago

In Kaldi, in egs/tedlium/s5_r2/local/train_ted_lm.sh [or something like that], we have a script that trains the pocolm LM. Because this was set up before we had train_lm.py, it doesn't use train_lm.py. I'd like it to be modified to use train_lm.py, and to have the option that bypasses metaparameter optimization set by default, to make the default build fast [but make it easy to comment out].

Ke, perhaps you could do this. This is really an issue in Kaldi, but mentioning it here.

keli78 commented 8 years ago

Sure, I'll have a try. Ke

keli78 commented 8 years ago

Hi Dan, During doing this task, I found two small errors in train_lm.py in the step of generating the vocab using wordlist (as the ted_train_lm.sh used a wordlist to generate the vocab): 1) index error 2) input to wordlist_to_vocab.py is wrong

I fixed the errors as below and tested it.

diff --git a/scripts/train_lm.py b/scripts/train_lm.py

index bbcfe05..e0fcea5 100755

--- a/scripts/train_lm.py

+++ b/scripts/train_lm.py

@@ -234,7 +234,7 @@ else:

     LogMessage("Skip generating vocab")

 else:

     LogMessage("Generating vocab with

wordlist[{0}]...".format(args.wordlist))

Do I need to do a PR for this?

Ke

On Sat, Aug 13, 2016 at 5:05 PM, Daniel Povey notifications@github.com wrote:

In Kaldi, in egs/tedlium/s5_r2/local/train_ted_lm.sh [or something like that], we have a script that trains the pocolm LM. Because this was set up before we had train_lm.py, it doesn't use train_lm.py. I'd like it to be modified to use train_lm.py, and to have the option that bypasses metaparameter optimization set by default, to make the default build fast [but make it easy to comment out].

Ke, perhaps you could do this. This is really an issue in Kaldi, but mentioning it here.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/47, or mute the thread https://github.com/notifications/unsubscribe-auth/ANVxSu-ws5zex3T_CwiAGhO6hrB74Ubpks5qfjGqgaJpZM4Jjwpk .

Ke Li Dept. of Electrical and Computer Engineering Johns Hopkins University Email: kli26@jhu.edu

danpovey commented 8 years ago

yes do a PR

On Sun, Aug 14, 2016 at 4:17 PM, Ke Li notifications@github.com wrote:

Hi Dan, During doing this task, I found two small errors in train_lm.py in the step of generating the vocab using wordlist (as the ted_train_lm.sh used a wordlist to generate the vocab): 1) index error 2) input to wordlist_to_vocab.py is wrong

I fixed the errors as below and tested it.

diff --git a/scripts/train_lm.py b/scripts/train_lm.py

index bbcfe05..e0fcea5 100755

--- a/scripts/train_lm.py

+++ b/scripts/train_lm.py

@@ -234,7 +234,7 @@ else:

LogMessage("Skip generating vocab")

else:

LogMessage("Generating vocab with wordlist[{0}]...".format(args.wordlist))

  • command = "wordlist_to_vocab.py {1} > {2}".format(word_counts_dir, vocab)
  • command = "wordlist_to_vocab.py {0} > {1}".format(args.wordlist, vocab)

log_file = os.path.join(log_dir, 'wordlist_to_vocab.log')

RunCommand(command, log_file, args.verbose == 'true')

TouchFile(done_file)

Do I need to do a PR for this?

Ke

On Sat, Aug 13, 2016 at 5:05 PM, Daniel Povey notifications@github.com wrote:

In Kaldi, in egs/tedlium/s5_r2/local/train_ted_lm.sh [or something like that], we have a script that trains the pocolm LM. Because this was set up before we had train_lm.py, it doesn't use train_lm.py. I'd like it to be modified to use train_lm.py, and to have the option that bypasses metaparameter optimization set by default, to make the default build fast [but make it easy to comment out].

Ke, perhaps you could do this. This is really an issue in Kaldi, but mentioning it here.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/47, or mute the thread https://github.com/notifications/unsubscribe-auth/ANVxSu-ws5zex3T_ CwiAGhO6hrB74Ubpks5qfjGqgaJpZM4Jjwpk .

Ke Li Dept. of Electrical and Computer Engineering Johns Hopkins University Email: kli26@jhu.edu

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/47#issuecomment-239704232, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuyDik_PaPS_gfwuTzeUR8dlgy6woks5qf6IdgaJpZM4Jjwpk .