There is some mistakes in align_lexicon.txt for french model 0.22

alphacep / vosk-server

WebSocket, gRPC and WebRTC speech recognition server based on Vosk and Kaldi libraries

Apache License 2.0

882 stars 243 forks source link

There is some mistakes in align_lexicon.txt for french model 0.22 #198

Open warichet opened 1 year ago

warichet commented 1 year ago

Hello, we make some test of vosk asr platform in french. The chosen model is vosk-model-fr-0.22, due to the license Apache2. The g2p used to generate align_lexicon.txt have somme bug. One of them is related to acronym treatment, i explain. In french the normal way to prononce c.... is "K" except for acronyme where it is "S E". For the acronym, we spell the letter.

To continue our tests we need to fine tune the LM wit the correct lexicon. Is it possible to have the arpa file to test our corrections ?

In return, of course, we will provide you the new lexicon.

Best regards Seb

nshmyrev commented 1 year ago

Is it possible to have the arpa file to test our corrections ?

Sure, you have to write email to contact@alphacephei.com and describe your project to get link to French model update package

warichet commented 1 year ago

Hello,

We will start the check/ correction of model. To do it easier the chosen is way is to use the french dataset of lingua-libre. lingua libre is a very complet annotated dataset. We will generate a 1 gram graph (aka model without graph) to check only Acoustic Model and Lexicon. Pass the audio files to Vosk and check words where transcription is different from the annotation. With this result we can check words with bad phonemes, and possibly check phonemes where AM have difficulties. We will do it into notebook. If you are interested by it, we can share the notebook.

Best regard Sébastien

nshmyrev commented 1 year ago

Sure, that would be nice.