KoljaB / RealtimeSTT

A robust, efficient, low-latency speech-to-text library with advanced voice activity detection, wake word activation and instant transcription.
MIT License
2.09k stars 190 forks source link

new fast models #18

Closed francqz31 closed 10 months ago

francqz31 commented 10 months ago

there are new fast stt models from nvidia they claim to be better than whisper v3: on here https://huggingface.co/spaces/hf-audio/open_asr_leaderboard , some of them are really really fast with almost the same accuracy as large whsiper models aka the parakeet family even this one https://huggingface.co/spaces/nvidia/parakeet-rnnt-1.1b is way faster than whisper v3!

francqz31 commented 10 months ago

i also know that coqui shut down , there is this really new tts model here https://github.com/PolyAI-LDN/pheme that claims to be really fast too , if both of this and parakeet got integrated into https://github.com/KoljaB/LocalAIVoiceChat i believe it will be a super boost better performance with faster speed , you can also apply some tricks to them to make them faster !!

KoljaB commented 10 months ago

The nvidia stt looks very promising. Word error rate better than whisper and if it's even faster it's for sure is a great candidate. Hope it does all languages well and not only english. I think currently it does not scale to low VRAM systems, Whisper offers tiny model...

pheme looks good, but tbh so do a lot of engines currently. For pure speed for example styletts2 is a really great engine. 6-7x faster than XTTS.

francqz31 commented 10 months ago

ok got it 👍 i just wanted to notify you , there is also a really new MIT licenced model that claims to be better than mistral 7B thus it mostly will be compatible with zypher! , it is only 2.7B so i bet it will be really fast https://huggingface.co/microsoft/phi-2 you might want to integrated into LocalAIVoiceChat for better speed while holding same accuracy!

francqz31 commented 10 months ago

now i will close the issue

tictproject commented 2 months ago

@KoljaB Are you planing to try one of this? Would be awesome to receive faster results since even with Cuda i receive fullSentence event after 3-4 seconds for 3 sentences of text which is not ideal