alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
8k stars 1.11k forks source link

Updating the language model - error #1145

Closed makdatascientist closed 2 years ago

makdatascientist commented 2 years ago
root@MAK:~/kaldi/tools/model# farcompilestrings --fst_type=compact --symbols=words.txt --keep_symbols text.txt | ngramcount | ngrammake | fstconvert --fst_type=ngram > Gr.new.fst
ERROR: ConvertSymbolToLabel: Symbol "117835" is not mapped to any integer label, symbol table = words.txt
FATAL: FarCompileStrings: Compiling string number 1 in file text.txt failed with token_type = symbol and entry_type = line
FATAL: STListReader: Error reading file: stdin
ERROR: FstHeader::Read: Bad FST header: standard input
ERROR: FstHeader::Read: Bad FST header: standard input
makdatascientist commented 2 years ago

what should be the format of text.txt file, please reply

nshmyrev commented 2 years ago

what should be the format of text.txt file, please reply

just a text file

makdatascientist commented 2 years ago

Is it possible to add abbreviation like aace in text.txt which is not available in word.txt

nshmyrev commented 2 years ago

Yes, see https://alphacephei.com/vosk/lm

nshmyrev commented 2 years ago

Please edit your post to format it properly.

nshmyrev commented 2 years ago

please format your post properly with code tags (angle brackets)

makdatascientist commented 2 years ago

please format your post properly with code tags (angle brackets)

First to give a brief, I'm using the model vosk-small-en-in-0.4, since it is a small model I'm updating only the graph using the below instruction,

1. export KALDI_ROOT=pwd/kaldi
2. git clone https://github.com/kaldi-asr/kaldi
3. cd kaldi/tools
4. make
5. extras/install_opengrm.sh
6. export PATH=$KALDI_ROOT/tools/openfst.-1.7.2/bin:$PATH
7. export LD_LIBRARY_PATH=$KALDI_ROOT/tools/openfst-1.7.2/lib/fst
8. cd model
9. fstsymbols --save_osymbols=words.txt Gr.fst > /dev/null
farcompilestrings --fst_type=compact --symbols=words.txt --keep_symbols text.txt |
ngramcount | ngrammake |
fstconvert --fst_type=ngram > Gr.new.fst
10. mv Gr.new.fst Gr.fst

but when I add a new word that is abbreviation like (aabe or aabe [unk]) it is giving error like below


ERROR: ConvertSymbolToLabel: Symbol "aabe" is not mapped to any integer label, symbol table = words.txt
FATAL: FarCompileStrings: Compiling string number 1 in file text.txt failed with token_type = symbol and entry_type = line
FATAL: STListReader: Error reading file: stdin
ERROR: FstHeader::Read: Bad FST header: standard input
ERROR: FstHeader::Read: Bad FST header: standard input

kindly suggest.

makdatascientist commented 2 years ago

Hi Nickolay V. Shmyrev, Kindly reply for my pervious query.

Thanks

makdatascientist commented 2 years ago

please format your post properly with code tags (angle brackets)

I have done the formatting, kindly check and reply.

nshmyrev commented 2 years ago

The approach you are trying can only modify probabilities of existing words, you can not add new words this way. To add new words you need a model compilation package (as described in the lm page linked above)

makdatascientist commented 2 years ago

Thanks a lot for your reply, really appreciated. For getting model compilation package only the below step are enough? kindly reply.

Graph compilation For performance all the models are compiled into more compact structures - FST graphs. If you want to modify them - add new words or adapt to a domain, you run several steps of graph compilation.

Not every Vosk model allows vocabulary modification of the graph. Some like US English, big Russian or German include all necessary files (“tree” file from the model which contains information about phoneme context dependency). Some don’t have required files, you need to contact Alphacephei to get access to them.

Hardware Compilation is not very slow, but still requires significant hardware - a Linux server with 32Gb RAM at least and 100Gb of disk space. It is unlikely you can compile a big model in a virtual machine. Small models require less data.

Software The following software must be pre-installed on a server:

Kaldi SRILM Phonetisaurus (with pip3 install phonetisaurus) In the future we might provide a docker for model compilation, for now you have to compile it yourself.

Update process Download the update package, for example:

Russian - https://alphacephei.com/vosk/models/vosk-model-ru-0.22-compile.zip

US English - https://alphacephei.com/vosk/models/vosk-model-en-us-0.22-compile.zip

German - https://alphacephei.com/vosk/models/vosk-model-de-0.21-compile.zip

French - https://alphacephei.com/vosk/models/vosk-model-fr-0.6-linto-2.2.0-compile.zip

Other language packs are available on request. Please contact us at contact@alphacephei.com

Unpack and properly point to KALDI_ROOT in the path.sh script Add your extra texts into db/extra.txt Optionally add manual words phones into db/extra.dic Run compile-graph.sh. Update takes about 15 minutes. Watch errors in the process. Run decode.sh to test decoding works successfully. Watch the WER in the decoding folder. Optionally, check that the g2p properly predicted the phonemes in the end of data/dict/lexicon.txt. If needed, update g2p model with new words.

Outputs Depending on your needs you might pick some result files from the compilation folder. Remember, that if you changed the graph you also need to change the rescoring/rnnlm part, otherwise they will go out of sync and accuracy will be low.

For large model pick the following parts:

exp/chain/tdnn/graph data/lang_test_rescore/G.fst and data/lang_test_rescore/G.carpa into rescore folder exp/rnnlm_out into rnnlm folder, you can delete some unnecessary files from rnnlm too. If you don’t want to use RNNLM, delete rnnlm folder from the model.

If you don’t want to use rescoring, delete the rescore folder from the model, that will save you some runtime memory, but accuracy will be lower.

For small model, just pick the required files from exp/chain/tdnn/lgraph.