alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
8.06k stars 1.11k forks source link

Vosk vs Kaldi #448

Closed EsakkiSundar closed 3 years ago

EsakkiSundar commented 3 years ago

I would like to understand what is the the difference between VOSK and Kaldi. When should we use Kaldi over Vosk and vice-versa. If someone could share their thoughts or point me to an article it will be of great help?

sskorol commented 3 years ago

Vosk internally uses Kaldi. You can take a look at sources to find out how: https://github.com/alphacep/vosk-api/blob/master/src/kaldi_recognizer.cc

Tortoise17 commented 3 years ago

As far as I understand the major VAD is Kaldi here. where as with deepspeech, it is from google WebRTCVAD

EsakkiSundar commented 3 years ago

Thanks a lot for your quick responses.

EsakkiSundar commented 3 years ago

One more question. Just curious to know the major reasons/USP why vosk was created. What benefits is vosk planning to provide which kaldi fails to deliver.

nshmyrev commented 3 years ago

One more question. Just curious to know the major reasons/USP why vosk was created. What benefits is vosk planning to provide which kaldi fails to deliver.

The answer is on the front page, kaldi provides neither of these:

Vosk is an offline open source speech recognition toolkit. It enables speech recognition models for 17 languages and dialects - English, Indian English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi, Filipino.

Vosk models are small (50 Mb) but provide continuous large vocabulary transcription, zero-latency response with streaming API, reconfigurable vocabulary and speaker identification.

Speech recognition bindings implemented for various programming languages like Python, Java, Node.JS, C#, C++ and others.

Vosk supplies speech recognition for chatbots, smart home appliances, virtual assistants. It can also create subtitles for movies, transcription for lectures and interviews.

Vosk scales from small devices like Raspberry Pi or Android smartphone to big clusters.