Error while running "compile-graph.sh"

Fahrgast commented 1 year ago

Hey,

when I run "compile-graph.sh" I run into the following error:

ngram-count -wbdiscount -order 4 -text db/extra.txt -lm data/extra.lm.gz

ngram -order 4 -lm db/en-230k-0.5.lm.gz -mix-lm data/extra.lm.gz -lambda 0.95 -write-lm data/en-mix.lm.gz db/en-230k-0.5.lm.gz: line 186020193: ngram line has 4 fields (6 expected) format error in lm file db/en-230k-0.5.lm.gz
ngram -order 4 -lm data/en-mix.lm.gz -prune 3e-8 -write-lm data/en-mixp.lm.gz data/en-mix.lm.gz: No such file or directory
ngram -lm data/en-mixp.lm.gz -write-lm data/en-mix-small.lm.gz data/en-mixp.lm.gz: No such file or directory
utils/prepare_lang.sh data/dict '[unk]' data/lang_local data/lang utils/prepare_lang.sh data/dict [unk] data/lang_local data/lang

I'm using the kaldi docker image. Also I tried it with an empty "extra.txt" and "extra.dic" and still run into the same error. The rest of the script runs fine until it requires the en-mix.lm again.

Any suggestion on how I can fix this?

nshmyrev commented 1 year ago

probably zip is broken somehow. What do you see on line 186020193 in unpacked db/en-230k-0.5.lm?

Fahrgast commented 1 year ago

how do I open it? VScode crashes because its too big :D

Also just redownloaded vosk-model-en-us-0.22-compile again and ran compile-graph again with a fresh en-230k-0.5.lm from it. Now getting this error: db/en-230k-0.5.lm.gz: line 182906226: ngram line has 3 fields (6 expected) format error in lm file db/en-230k-0.5.lm.gz

Similar but different line.

nshmyrev commented 1 year ago

You can unpack file and open with vi.

Probably it fails because you are running out of memory/disk space.

Fahrgast commented 1 year ago

This is line 186020193: " -1.040099 through and they'd taken" and this line 182906226: "-3.110893 diseases of the veins"

They dont look any different than the other lines to me.

nshmyrev commented 1 year ago

How much memory do you have in your system?

Fahrgast commented 1 year ago

16GB RAM and about 330GB left on Drive

nshmyrev commented 1 year ago

16gb is not much. You can probably prune big model with ngram -prune before mixing.

Fahrgast commented 1 year ago

could you please elaborate on that? Should I edit a file like dict.py and add ngram -prune somewhere or use command line?

nshmyrev commented 1 year ago

ngram -order 4 -lm en-230k-0.5.lm.gz -prune 1e-9 -write-lm en-230k-0.5.smaller.lm.gz

Fahrgast commented 1 year ago

ngram -order 4 -lm en-230k-0.5.lm.gz -prune 1e-9 -write-lm en-230k-0.5.smaller.lm.gz produces this error sadly: en-230k-0.5.lm.gz: line 182906226: ngram line has 3 fields (6 expected) format error in lm file en-230k-0.5.lm.gz

nshmyrev commented 1 year ago

hm, maybe file is broken, maybe you can redownload it again

nshmyrev commented 1 year ago

Also check the last line in a file, that offset 182906226 might be relative. And md5sum here it is:

md5sum en-230k-0.5.lm.gz 
60f3292075dbb3407ed4bd1df8d5cf28  en-230k-0.5.lm.gz

Fahrgast commented 1 year ago

The md5sum is identical.

So by coincidence I had to get 2 new RAMs because one broke, now I have 24Gb RAM. I just ran compile-graph again and now it got all the way to line 43302737 before hitting the error I had previously. So I guess it really is a memory issue.

Now tried it after pruning the model first as you suggested. That ran fine until I got this error and it stoped:

utils/map_arpa_lm.pl: Processing "\data\"
utils/map_arpa_lm.pl: Processing "\1-grams:\"
utils/map_arpa_lm.pl: Warning: OOV line -5.564116       'a      -0.003554899
utils/map_arpa_lm.pl: Warning: OOV line -6.14543        'all    -0.1315524
utils/map_arpa_lm.pl: Warning: OOV line -7.045512       'am     -0.2532192
utils/map_arpa_lm.pl: Warning: OOV line -8.18103        'amour  -0.02055213
utils/map_arpa_lm.pl: Warning: OOV line -8.207732       'angelo -0.008941075
utils/map_arpa_lm.pl: Warning: OOV line -8.016088       'apercois
utils/map_arpa_lm.pl: Warning: OOV line -8.249146       'aquila
utils/map_arpa_lm.pl: Warning: OOV line -8.25028        'arche  -0.0179535
utils/map_arpa_lm.pl: Warning: OOV line -7.926421       'brian  -0.00683306
utils/map_arpa_lm.pl: Warning: OOV line -7.930167       'cuse   -0.02823315
utils/map_arpa_lm.pl: Warning: OOV line -8.30919        'dour   -0.05702549
utils/map_arpa_lm.pl: Warning: OOV line -3.860634       'em     -0.2778654
utils/map_arpa_lm.pl: Warning: OOV line -8.264342       'espace -0.08082926
utils/map_arpa_lm.pl: Warning: OOV line -8.071111       'est    -0.02308908
utils/map_arpa_lm.pl: Warning: OOV line -8.303288       'grady  -0.07864463
utils/map_arpa_lm.pl: Warning: OOV line -5.932466       'in     -0.1896701
utils/map_arpa_lm.pl: Warning: OOV line -8.024973       'ites
utils/map_arpa_lm.pl: Warning: OOV line -8.258896       'ivoire -0.06746501
utils/map_arpa_lm.pl: Warning: OOV line -8.236024       'lin    -0.06277915
LOG (arpa-to-const-arpa[5.5.0~1-ae8c]:Read():arpa-file-parser.cc:94) Reading \data\ section.
LOG (arpa-to-const-arpa[5.5.0~1-ae8c]:Read():arpa-file-parser.cc:149) Reading \1-grams: section.
utils/map_arpa_lm.pl: Processing "\2-grams:\"
LOG (arpa-to-const-arpa[5.5.0~1-ae8c]:Read():arpa-file-parser.cc:149) Reading \2-grams: section.
utils/map_arpa_lm.pl: Processing "\3-grams:\"
LOG (arpa-to-const-arpa[5.5.0~1-ae8c]:Read():arpa-file-parser.cc:149) Reading \3-grams: section.
utils/map_arpa_lm.pl: Processing "\4-grams:\"
LOG (arpa-to-const-arpa[5.5.0~1-ae8c]:Read():arpa-file-parser.cc:149) Reading \4-grams: section.
utils/map_arpa_lm.pl: 1645653 lines of the Arpa file contained OOVs and were not printed.
+ rnnlm/change_vocab.sh data/lang/words.txt exp/rnnlm exp/rnnlm_out
rnnlm/change_vocab.sh: Copying config directory.
rnnlm/change_vocab.sh: Re-generating words.txt, unigram_probs.txt, word_feats.txt and word_embedding.final.mat.
rnnlm/get_word_features.py: made features for 312358 words.
rnnlm-get-word-embedding: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory
+ utils/mkgraph_lookahead.sh --self-loop-scale 1.0 data/lang exp/chain/tdnn data/en-mix-small.lm.gz exp/chain/tdnn/lgraph
utils/mkgraph_lookahead.sh : compiling grammar data/en-mix-small.lm.gz
tree-info exp/chain/tdnn/tree
tree-info exp/chain/tdnn/tree
fstdeterminizestar data/lang/L_disambig.fst
fstcomposecontext --context-size=2 --central-position=1 --read-disambig-syms=data/lang/phones/disambig.int --write-disambig-syms=exp/chain/tdnn/lgraph/disambig_ilabels_2_1.int exp/chain/tdnn/lgraph/ilabels_2_1.342 exp/chain/tdnn/lgraph/L_disambig_det.fst
make-h-transducer --disambig-syms-out=exp/chain/tdnn/lgraph/disambig_tid.int --transition-scale=1.0 exp/chain/tdnn/lgraph/ilabels_2_1 exp/chain/tdnn/tree exp/chain/tdnn/final.mdl
fstdeterminizestar
add-self-loops --disambig-syms=exp/chain/tdnn/lgraph/disambig_tid.int --self-loop-scale=1.0 --reorder=true exp/chain/tdnn/final.mdl
apply_map.pl: warning! missing key 0 in exp/chain/tdnn/lgraph/relabel
apply_map.pl: warning! missing key 312356 in exp/chain/tdnn/lgraph/relabel

Maybe I could try to remove a couple thousand lines from en-230k-0.5.lm.gz instead of prune? Or will that destroy the accuracy?

nshmyrev commented 1 year ago

Last log doesn't have errors, warnings are expected, you can use new model

nshmyrev commented 1 year ago

Except rnnlm part where you have issue with cuda library path

Fahrgast commented 1 year ago

I finally got it to work by changing the order from 4 to 2 because for some reason the process got stuck while "reading 3-grams". The Model might not be perfect but good enough for my purpose.

So to summarize: I think the problem was RAM related. I would guess at least 20gb+ is needed.

Thank you so much for your help nshmyrev!

alphacep / vosk-api

Error while running "compile-graph.sh" #1267