alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
8.17k stars 1.12k forks source link

About recognized text based on HCLr.fst, Gr.fst #1661

Open donaldos opened 6 days ago

donaldos commented 6 days ago

Dear Nickolay V. Shmyrev

I have been using the vosk api to generate and recognise dynamic grammars well. In particular, I have been testing a lot with a customised engine configuration by generating HCLr.fst and Gr.fst using compile-graph.sh. The model I used was based on the final.mdl file of ‘vosk-model-small-en-us-0.15’ and generated the final file. For the acoustic model upgrade, we used the final.mdl of vosk-model-en-us-0.22 to generate HCLr.fst and Gr.fst.

INFO (2024-11-20 09:25:06,800:main): [EnumaeduSREngineFile.py:57] - proc_sr() - [‘sit down please’, ‘[unk]’] LOG (VoskAPI:UpdateGrammarFst():recogniser.cc:287) [‘sit down please’, ‘[unk]’] LOG (VoskAPI:Estimate():language_model.cc:142) Estimating language model with ngram-order=2, discount=0.5 LOG (VoskAPI:OutputToFst():language_model.cc:209) Created language model with 5 states and 10 arcs. As you can see in the log, when you utter the phrase ‘sit down please’, it recognises only [{‘start’: 0.570133, ‘end’: 1.409742, ‘word’: ‘down’, ‘conf’: 1.0}]. However, if you use the full resource vosk-model-en-us-0.22-lgraph, which is the example, you will not encounter any problems.

What could be the cause of this, and is there a methodology to validate it?

and Where do I need to copy resources from? "exp>tdnn>lgraph" or "exp>tdnn>lgraph_orig"

donaldos commented 6 days ago

Directory structure

. ├── am │ ├── final.mdl │ └── tree ├── conf │ ├── mfcc.conf │ └── model.conf ├── graph │ ├── Gr.fst │ ├── HCLr.fst │ ├── disambig_tid.int │ ├── phones │ │ ├── align_lexicon.int │ │ ├── align_lexicon.txt │ │ ├── disambig.int │ │ ├── disambig.txt │ │ ├── optional_silence.csl │ │ ├── optional_silence.int │ │ ├── optional_silence.txt │ │ ├── silence.csl │ │ ├── word_boundary.int │ │ └── word_boundary.txt │ ├── phones.txt │ └── words.txt └── ivector ├── final.dubm ├── final.ie ├── final.mat ├── global_cmvn.stats ├── online_cmvn.conf └── splice.conf

nshmyrev commented 5 days ago

Probably pronunciation issue. Please provide an audio sample.

donaldos commented 4 days ago

Can I forward speech data and models via email? nshmyrev@gmail.com

nshmyrev commented 4 days ago

Sure