danpovey / pocolm

Small language toolkit for creation, interpolation and pruning of ARPA language models
Other
90 stars 48 forks source link

no data processed #101

Open gonese opened 3 years ago

gonese commented 3 years ago

get_objf_and_derivs_split.py: command discount-counts 0.8 0.4 0.2 0.1 data/local/local_lm/data/work/optimize_wordl_wordist_4_train-2_ted-1_subset20/work/split10/1/merged.4 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset20/work/split10/1/float.4 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset20/work/split10/1/discount.3 exited with status 1

output error log : discount-counts: processed no data I found merged.4 ,float.4, and discount.3 are empty but I do find the non-empty.txt files and .gz file under the input directory for train_lm.py.

danpovey commented 3 years ago

Would need much more context, e.g. the whole output from when you were training.

On Sat, Nov 28, 2020 at 10:19 AM Chuhui Chen notifications@github.com wrote:

get_objf_and_derivs_split.py: command discount-counts 0.8 0.4 0.2 0.1 data/local/local_lm/data/work/optimize_wordl_wordist_4_train-2_ted-1_subset20/work/split10/1/merged.4 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset20/work/split10/1/float.4 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset20/work/split10/1/discount.3 exited with status 1

output error log : discount-counts: processed no data I found merged.4 ,float.4, and discount.3 are empty but I do find the non-empty.txt files and .gz file under the input directory for train_lm.py.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/101, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5PD6AHDMI7UDZWJW3SSBMZ7ANCNFSM4UFPKGSQ .

gonese commented 3 years ago

Would need much more context, e.g. the whole output from when you were training. On Sat, Nov 28, 2020 at 10:19 AM Chuhui Chen @.***> wrote: get_objf_and_derivs_split.py: command discount-counts 0.8 0.4 0.2 0.1 data/local/local_lm/data/work/optimize_wordl_wordist_4_train-2_ted-1_subset20/work/split10/1/merged.4 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset20/work/split10/1/float.4 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset20/work/split10/1/discount.3 exited with status 1 output error log : discount-counts: processed no data I found merged.4 ,float.4, and discount.3 are empty but I do find the non-empty.txt files and .gz file under the input directory for train_lm.py. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#101>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5PD6AHDMI7UDZWJW3SSBMZ7ANCNFSM4UFPKGSQ .

I'm using the Tedlium example in Kaldi. The major change I made is turning every word into characters such as word -> w o r d @. And make a dictionary with 26 alphabets and some special symbols. Here is the full log during the training: local/ted_train_lm.sh: training the unpruned LM /home/cc4651/kaldi-trunk/egs/tedlium/s5_r3/../../../tools/pocolm/scripts/train_lm.py --wordlist=data/local/local_lm /data/wordlist --num-splits=5 --warm-start-ratio=10 --limit-unk-history=true --fold-dev-into=ted "--min-counts=trai n=2 ted=1" data/local/local_lm/data/text 4 data/local/local_lm/data/work data/local/local_lm/data/wordlist_4_train- 2_ted-1.pocolm train_lm.py: Skip getting word counts train_lm.py: Skip getting unigram weights train_lm.py: Skip generating vocab train_lm.py: Preparing int data... log in data/local/local_lm/data/work/log/wordlist/prepare_int_data.log train_lm.py: Getting ngram counts... log in data/local/local_lm/data/work/log/wordlist_4_train-2_ted-1/get_counts.l og get_counts.py: extending min-counts from 2.0,2.0 to 2.0,2.0 since ngram order is 4 get_counts.py: extending min-counts from 1.0,1.0 to 1.0,1.0 since ngram order is 4 validate_vocab.py: validated file data/local/local_lm/data/work/int_wordlist/words.txt with 31 entries. train_lm.py: Subsetting counts dir... log in data/local/local_lm/data/work/log/wordlist_4_train-2_ted-1/subset_coun t_dir.log train_lm.py: Optimizing metaparameters for warm-start... log in data/local/local_lm/data/work/log/wordlist_4_train- 2_ted-1/optimize_metaparameters_warm_start.log train_lm.py: command optimize_metaparameters.py --cleanup=true --progress-tolerance=1.0e-05 --num-splits=5 data/loc al/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 data/local/local_lm/data/work/optimize_wordlist_4_tr ain-2_ted-1_subset10 exited with status 1, output is in data/local/local_lm/data/work/log/wordlist_4_train-2_ted-1/ optimize_metaparameters_warm_start.log

optimize_metaparameter_warm_start.log:

optimize_metaparameters.py --cleanup=true --progress-tolerance=1.0e-05 --num-splits=5 data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10

running at Sat Nov 28 17:57:40 2020

/home/cc4651/kaldi-trunk/egs/tedlium/s5_r3/../../../tools/pocolm/scripts/optimize_metaparameters.py --cleanup=true --progress-tolerance=1.0e-05 --num-splits=5 data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10 validate_vocab.py: validated file data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10/words.txt with 31 entries. validate_count_dir.py: validated counts directory data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 validate_vocab.py: validated file data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10/words.txt with 31 entries. validate_count_dir.py: validated counts directory data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 /home/cc4651/kaldi-trunk/egs/tedlium/s5_r3/../../../tools/pocolm/scripts/split_count_dir.sh: creating split counts in data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10/split5 split-int-counts: processed 948 LM states, with the counts for each output respectively as: 0 2185 1560 446 0 split-int-counts: processed 0 LM states, with the counts for each output respectively as: 0 0 0 0 0 split-int-counts: processed 3 LM states, with the counts for each output respectively as: 0 12 13 4 0 split-int-counts: processed 945 LM states, with the counts for each output respectively as: 0 2173 1547 442 0 split-int-counts: processed 0 LM states, with the counts for each output respectively as: 0 0 0 0 0 split-int-counts: processed 4 LM states, with the counts for each output respectively as: 0 27 24 7 0 split-int-counts: processed 1454 LM states, with the counts for each output respectively as: 0 4546 4080 1392 0 split-int-counts: processed 3 LM states, with the counts for each output respectively as: 0 9 13 25 0 split-int-counts: processed 90 LM states, with the counts for each output respectively as: 0 732 704 547 0 split-int-counts: processed 2057 LM states, with the counts for each output respectively as: 0 10580 9942 4894 0 validate_vocab.py: validated file data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10/split5/1/words.txt with 31 entries. validate_count_dir.py: validated counts directory data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10/split5/1 /home/cc4651/kaldi-trunk/egs/tedlium/s5_r3/../../../tools/pocolm/scripts/split_count_dir.sh: Success optimize_metaparameters.py: running command 'get_objf_and_derivs_split.py --num-splits=5 --cleanup=true --derivs-out=data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.derivs data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.metaparams data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.objf data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work', log in data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.log optimize_metaparameters.py: command get_objf_and_derivs_split.py --num-splits=5 --cleanup=true --derivs-out=data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.derivs data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.metaparams data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.objf data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work exited with status 1, output is in data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.log

exited with return code 1 after 2.8 seconds

0.log :

get_objf_and_derivs_split.py --num-splits=5 --cleanup=true --derivs-out=data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.derivs data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.metaparams data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.objf data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work

running at Sat Nov 28 17:57:42 2020

validate_vocab.py: validated file data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10/words.txt with 31 entries. validate_count_dir.py: validated counts directory data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 validate_vocab.py: validated file data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10/split5/1/words.txt with 31 entries. validate_count_dir.py: validated counts directory data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10/split5/1 get_objf_and_derivs_split.py: command discount-counts 0.8 0.4 0.2 0.1 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work/split5/1/merged.4 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work/split5/1/float.4 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work/split5/1/discount.3 exited with status 1, output is in data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work/log/discount_counts.1.4.log

exited with return code 1 after 0.2 seconds

discount_counts.1.4.log:

discount-counts 0.8 0.4 0.2 0.1 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work/split5/1/merged.4 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work/split5/1/float.4 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work/split5/1/discount.3

running at Sat Nov 28 17:57:43 2020

discount-counts: processed no data

exited with return code 1 after 0.0 seconds

** what I feel wired is that under data/local/local_lm/data/work/work_counts, I found the non-empty file like dev.counts dev.counts: 13 $ 4695 ' 1 & 1 + 34 1 58 0 18 3 35 2 15 5 6 4 11 7 3 6 12 9 9 8 3 = 31 < 31 > 181063 @ 15 [ 16 ] 1 ^ 62291 a 20230 c 11400 b 91399 e 27488 d 16547 g 14811 f 55215 i 40121 h 6391 k 1292 j 17746 m 30384 l 58995 o 52408 n 602 q 13542 p 46512 s 40603 r 22666 u 75748 t 17919 w 7797 v 16723 y 1405 x 731 z 8636

danpovey commented 3 years ago

I recommend to use a different tool, e.g. SRILM, kaldi_lm. pocolm wasn't designed for this type of data and I'm afraid it would be too much work right now for me to get it to work in this case.

On Sun, Nov 29, 2020 at 2:11 AM Chuhui Chen notifications@github.com wrote:

Would need much more context, e.g. the whole output from when you were training. … <#m-4607388594736482363> On Sat, Nov 28, 2020 at 10:19 AM Chuhui Chen @.***> wrote: get_objf_and_derivs_split.py: command discount-counts 0.8 0.4 0.2 0.1 data/local/local_lm/data/work/optimize_wordl_wordist_4_train-2_ted-1_subset20/work/split10/1/merged.4 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset20/work/split10/1/float.4 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset20/work/split10/1/discount.3 exited with status 1 output error log : discount-counts: processed no data I found merged.4 ,float.4, and discount.3 are empty but I do find the non-empty.txt files and .gz file under the input directory for train_lm.py. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#101 https://github.com/danpovey/pocolm/issues/101>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5PD6AHDMI7UDZWJW3SSBMZ7ANCNFSM4UFPKGSQ .

I'm using the Tedlium example in Kaldi. The major change I made is turning every word into characters such as word -> w o r d @. And make a dictionary with 26 alphabets and some special symbols. Here is the full log during the training: local/ted_train_lm.sh: training the unpruned LM /home/cc4651/kaldi-trunk/egs/tedlium/s5_r3/../../../tools/pocolm/scripts/train_lm.py --wordlist=data/local/local_lm /data/wordlist --num-splits=5 --warm-start-ratio=10 --limit-unk-history=true --fold-dev-into=ted "--min-counts=trai n=2 ted=1" data/local/local_lm/data/text 4 data/local/local_lm/data/work data/local/local_lm/data/wordlist_4_train- 2_ted-1.pocolm train_lm.py: Skip getting word counts train_lm.py: Skip getting unigram weights train_lm.py: Skip generating vocab train_lm.py: Preparing int data... log in data/local/local_lm/data/work/log/wordlist/prepare_int_data.log train_lm.py: Getting ngram counts... log in data/local/local_lm/data/work/log/wordlist_4_train-2_ted-1/get_counts.l og get_counts.py: extending min-counts from 2.0,2.0 to 2.0,2.0 since ngram order is 4 get_counts.py: extending min-counts from 1.0,1.0 to 1.0,1.0 since ngram order is 4 validate_vocab.py: validated file data/local/local_lm/data/work/int_wordlist/words.txt with 31 entries. train_lm.py: Subsetting counts dir... log in data/local/local_lm/data/work/log/wordlist_4_train-2_ted-1/subset_coun t_dir.log train_lm.py: Optimizing metaparameters for warm-start... log in data/local/local_lm/data/work/log/wordlist_4_train- 2_ted-1/optimize_metaparameters_warm_start.log train_lm.py: command optimize_metaparameters.py --cleanup=true --progress-tolerance=1.0e-05 --num-splits=5 data/loc al/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 data/local/local_lm/data/work/optimize_wordlist_4_tr ain-2_ted-1_subset10 exited with status 1, output is in data/local/local_lm/data/work/log/wordlist_4_train-2_ted-1/ optimize_metaparameters_warm_start.log

optimize_metaparameter_warm_start.log: optimize_metaparameters.py --cleanup=true --progress-tolerance=1.0e-05 --num-splits=5 data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10 running at Sat Nov 28 17:57:40 2020

/home/cc4651/kaldi-trunk/egs/tedlium/s5_r3/../../../tools/pocolm/scripts/optimize_metaparameters.py --cleanup=true --progress-tolerance=1.0e-05 --num-splits=5 data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10 validate_vocab.py: validated file data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10/words.txt with 31 entries. validate_count_dir.py: validated counts directory data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 validate_vocab.py: validated file data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10/words.txt with 31 entries. validate_count_dir.py: validated counts directory data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 /home/cc4651/kaldi-trunk/egs/tedlium/s5_r3/../../../tools/pocolm/scripts/split_count_dir.sh: creating split counts in data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10/split5 split-int-counts: processed 948 LM states, with the counts for each output respectively as: 0 2185 1560 446 0 split-int-counts: processed 0 LM states, with the counts for each output respectively as: 0 0 0 0 0 split-int-counts: processed 3 LM states, with the counts for each output respectively as: 0 12 13 4 0 split-int-counts: processed 945 LM states, with the counts for each output respectively as: 0 2173 1547 442 0 split-int-counts: processed 0 LM states, with the counts for each output respectively as: 0 0 0 0 0 split-int-counts: processed 4 LM states, with the counts for each output respectively as: 0 27 24 7 0 split-int-counts: processed 1454 LM states, with the counts for each output respectively as: 0 4546 4080 1392 0 split-int-counts: processed 3 LM states, with the counts for each output respectively as: 0 9 13 25 0 split-int-counts: processed 90 LM states, with the counts for each output respectively as: 0 732 704 547 0 split-int-counts: processed 2057 LM states, with the counts for each output respectively as: 0 10580 9942 4894 0 validate_vocab.py: validated file data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10/split5/1/words.txt with 31 entries. validate_count_dir.py: validated counts directory data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10/split5/1 /home/cc4651/kaldi-trunk/egs/tedlium/s5_r3/../../../tools/pocolm/scripts/split_count_dir.sh: Success optimize_metaparameters.py: running command 'get_objf_and_derivs_split.py --num-splits=5 --cleanup=true --derivs-out=data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.derivs data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.metaparams data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.objf data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work', log in data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.log optimize_metaparameters.py: command get_objf_and_derivs_split.py --num-splits=5 --cleanup=true --derivs-out=data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.derivs data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.metaparams data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.objf data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work exited with status 1, output is in data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.log exited with return code 1 after 2.8 seconds

0.log : get_objf_and_derivs_split.py --num-splits=5 --cleanup=true --derivs-out=data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.derivs data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.metaparams data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/0.objf data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work running at Sat Nov 28 17:57:42 2020

validate_vocab.py: validated file data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10/words.txt with 31 entries. validate_count_dir.py: validated counts directory data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10 validate_vocab.py: validated file data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10/split5/1/words.txt with 31 entries. validate_count_dir.py: validated counts directory data/local/local_lm/data/work/counts_wordlist_4_train-2_ted-1_subset10/split5/1 get_objf_and_derivs_split.py: command discount-counts 0.8 0.4 0.2 0.1 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work/split5/1/merged.4 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work/split5/1/float.4 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work/split5/1/discount.3 exited with status 1, output is in data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work/log/discount_counts.1.4.log exited with return code 1 after 0.2 seconds

discount_counts.1.4.log: discount-counts 0.8 0.4 0.2 0.1 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work/split5/1/merged.4 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work/split5/1/float.4 data/local/local_lm/data/work/optimize_wordlist_4_train-2_ted-1_subset10/work/split5/1/discount.3 running at Sat Nov 28 17:57:43 2020

discount-counts: processed no data exited with return code 1 after 0.0 seconds

** what I feel wired is that under data/local/local_lm/data/work/work_counts, I found the non-empty file like dev.counts dev.counts: 13 $ 4695 ' 1 & 1 + 34 1 58 0 18 3 35 2 15 5 6 4 11 7 3 6 12 9 9 8 3 = 31 < 31 > 181063 @ 15 [ 16 ] 1 ^ 62291 a 20230 c 11400 b 91399 e 27488 d 16547 g 14811 f 55215 i 40121 h 6391 k 1292 j 17746 m 30384 l 58995 o 52408 n 602 q 13542 p 46512 s 40603 r 22666 u 75748 t 17919 w 7797 v 16723 y 1405 x 731 z 8636

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/101#issuecomment-735271413, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5F6WDERULFZCXLAKLSSE4OLANCNFSM4UFPKGSQ .