alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.7k stars 1.08k forks source link

How to use Vosk Punctuation Model in C# #1302

Open securigy opened 1 year ago

securigy commented 1 year ago

I've been googling and browsing all day long but cannot find how to use Vosk Punctuation models, especially in C#. Is it supported at all? If yes, any example?

I am also looking for an answer to the following question: Using Speaker Models - is it possible without training, that is, based on differences in voice pitch, and some other audio characteristics, etc.

nshmyrev commented 1 year ago

Not yet, we are working on universal punctuation to use from other languages, but it take time.

For speaker models, you can use pretrained model, yes. They detect pitch differences and map them to xvector.

securigy commented 1 year ago

Punctuation - got it.

Speaker models - that's a shame, because I do not have pretrained model. I was hoping that there is a generic model that can detect difference in voice pitch... Making my own is beyond my knowledge and capability at this time...

nshmyrev commented 1 year ago

Making my own is beyond my knowledge and capability at this time...

It is in downloads, see

https://alphacephei.com/vosk/models/vosk-model-spk-0.4.zip

nshmyrev commented 1 year ago

For usage see https://github.com/alphacep/vosk-api/issues/405

securigy commented 1 year ago

Well, the model is there, but it is absolutely not clear how to recognize one person speaking from another... There are some py codes, but I have no idea still about all the numbers and comparisons needed to be made to achieve that.. So I have to drop it for now...

BTW, is there any way to delegate work to GPU? Do I need to recognize in code first that I have adequate GPU and if yes, how?

rehberim360 commented 4 months ago

2 days were wasted. Vosk is really good at transcribing voice to text. But I think speaker recognition is not ready yet. There is neither a proper source nor an example. Everyone has written something from every angle, but it is all empty. I think it is necessary to prepare a detailed document for speaker recognition.