alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.99k stars 1.11k forks source link

Using vosk xvectors for UIS-RNN #275

Open epinay1 opened 3 years ago

epinay1 commented 3 years ago

Hi I'm trying to use vosk to get the embedding for Google's UIS-RNN. Could you please tell me if the x-Vector generated by vosk is the same as a D-Vector?

If not, is there some way to get a D-Vector from vosk or otherwise?

Thank you.

nshmyrev commented 3 years ago

Could you please tell me if the x-Vector generated by vosk is the same as a D-Vector?

No, it is different

If not, is there some way to get a D-Vector from vosk or otherwise?

You can use xvectors for uis-rnn algorithm. https://arxiv.org/pdf/1911.01266.pdf

The problem is that vosk extracts vectors per-utterance, you probably need better granularity.

epinay1 commented 3 years ago

Thank you for your reply, it was very helpful.

I had another question, is there an optimal length or max length of data that can be put in at once into vosk at once, for speech to text in order to get the best results? the reason I ask is there is a stdout being used to feed parts of audio at once rather than the whole audio, and I have noticed that the final text changes by changing this. If yes, then is there a way that this length can be calculated.

nshmyrev commented 3 years ago

is there an optimal length or max length of data that can be put in at once into vosk at once, for speech to text in order to get the best results?

It is better to follow the samples in the code and feed about 0.2 seconds at once.

In the future we will make it independent but it will require API change.

epinay1 commented 3 years ago

I have noticed that I get better results by putting longer segments through, some words (mostly small) get omitted with shorter segments. and thus I was asking if there is a max length that can be put it, and if yes, how can I calculate that?

nshmyrev commented 3 years ago

You can try different sizes an see. You need couple of gigabytes to process 1 hour at once I suppose.