alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.7k stars 1.08k forks source link

speaker recognition #327

Closed Tortoise17 closed 3 years ago

Tortoise17 commented 3 years ago

This is the most top great library I have seen. Thank you for such great effort.

I have few questions.

I am using CentOS.

1) I have to ask your help how to get the speaker recognition with the transcription. 2) Can one transcribe with live audio? I am interested in live audio transcription as well 3) Is it possible to customize and train engine at my own? If you can give little documentation link how to do that. 4) This library and models provided are with Apache 2.0 license? Can be used freely accordingly?

Please guide and help.

I wish you nice Christmas, Happy new year and stay healthy, safe from Corona.

nshmyrev commented 3 years ago

I have to ask your help how to get the speaker recognition with the transcription.

You can find an example here: https://github.com/alphacep/vosk-api/blob/master/python/example/test_speaker.py

Can one transcribe with live audio? I am interested in live audio transcription as well

Yes

Is it possible to customize and train engine at my own? If you can give little documentation link how to do that.

Yes, you can train the model and adapt the model to your needs, you can check.

https://alphacephei.com/vosk/adaptation https://alphacephei.com/vosk/models#training-your-own-model

This library and models provided are with Apache 2.0 license? Can be used freely accordingly?

Library is Apache2. Models have different license, mostly Apache2 too, details on the model download page. Yes, you can use them freely.

Tortoise17 commented 3 years ago

Your library is my best learning of 2020 before ending this year I must admit. Thank you so much agian. Stay healthy !!

nshmyrev commented 3 years ago

You are welcome, let me know if you have further questions.

Tortoise17 commented 3 years ago

I found out that your training is using Kaldi based trainer with nnet3? and not the tensorflow? I must admit. Your work is several steps ahead of google, facebook, youtube and twitter and many other researchers. I would like to join your community too. Awesome .

Tortoise17 commented 3 years ago

I have tried the speaker recognition. I am having as output with some distances cosine based and the text, while I was expecting the speaker number depending on weights with the duration (start, end, speaker number, text ) or at least the speaker number with the text which I could export in json. I just tried your example with simple command python test_speaker.py audio/input.wav . Maybe I am approaching it wrong way. Can you guide me a bit if something like this possible or the proper way to use this.

nshmyrev commented 3 years ago

You record samples for each speaker, get vectors from them by simply running the recognition.

You record sample for identification, get the vector from recognizer and compute distances to target speakers, the one with the closest distance is the result.

Tortoise17 commented 3 years ago

Thank you so much. That I understand completely. Is there any way to get this information like json output preety or so to use these values. Also, I am facing problem with these lines 49, 50, 56 and 57

#        print ("X-vector:", res['spk'])
#        print ("Speaker distance:", cosine_dist(spk_sig, res['spk']), "based on", res['spk_frames'], "frames")
#print ("X-vector:", res['spk'])
#print ("Speaker distance:", cosine_dist(spk_sig, res['spk']), "based on", res['spk_frames'], "frames")

and at the moment, I am commenting out but, maybe these have actual info which I am not getting as output. If you can guide what to do with this. I am using current version of vosk and current available models.

nshmyrev commented 3 years ago

Is there any way to get this information like json output preety or so to use these values.

Results are in json format

Also, I am facing problem with these lines 49, 50, 56 and 57

what problem exactly?

Tortoise17 commented 3 years ago

When I run the command from your example run: $python test_speaker.py out.json

Speaker distance: 0.8933986160396366 based on 462 frames
Text: 
Traceback (most recent call last):
  File "test_speaker.py", line 49, in <module>
    print ("X-vector:", res['spk'])
KeyError: 'spk'

These are last lines and it crashes there. Maybe I am using command wrongly? vosk=0.3.15 spk-model=0.4

nshmyrev commented 3 years ago

If audio is short speaker might be missing, you can add a check

if 'spk' in res:
     do_something
Tortoise17 commented 3 years ago

If you can suggest, would be kind because I don't know its limitations or hinderness where the vocals are too short how much it can identify. I guess atleast should be 4 seconds? I am new to this framework.

nshmyrev commented 3 years ago

The longer the better. Minimum is below 1 second. 4 seconds is reasonable.

Tortoise17 commented 3 years ago

it would be kind if you can suggest the lines to add, i could test. So, that at least I can run the trial until successful json export. Later I would readjust.

nshmyrev commented 3 years ago

it would be kind if you can suggest the lines to add, i could test. So, that at least I can run the trial until successful json export. Later I would readjust.

Sorry, what do you need?

Tortoise17 commented 3 years ago

actually, i want to get rid of this error and get the json export with speaker identification. If you can tell me what lines needed to be added here for this in the example run.

nshmyrev commented 3 years ago

See updated example here:

https://github.com/alphacep/vosk-api/blob/master/python/example/test_speaker.py

Tortoise17 commented 3 years ago

Again a bug thank you. The current problem is resolved. thank you. ! but the json is not exported. is this the correct way to use this? $python test_speaker.py audio/input.wav result_out.json

sskorol commented 3 years ago

@Tortoise17 current example doesn't export anything. It just prints results to the console. But saving res to JSON file is one of the most common programming tasks, and not related to speaker recognition. I believe you can easily implement it by yourself in 5 minutes.

Tortoise17 commented 3 years ago

Thank you. and bundles of thank you for such great work. wish you nice holidays celebrations and Happy new year. /may your new year brings lots o happiness to you. Stay healthy.

Tortoise17 commented 3 years ago

Is this possible to get the time start with the speaker distance value / or and duration? Because I am unable to understand what other information this model provides with the distances and text. I tried to see what info is in rec.Result() but failed.

nshmyrev commented 3 years ago

Is this possible to get the time start with the speaker distance value / or and duration? Because I am unable to understand what other information this model provides with the distances and text.

result in json format contains timing together with speaker vector. What problem to parse it do you have exactly?

milind-soni commented 2 years ago

@Tortoise17 Were you able to figure out how to generate transcript with speaker labels?

russele7 commented 2 years ago

@nshmyrev Dear Nickolay, I am trying your script test_speaker.py for audio file with several voises and I found that X-vectors are calculating for pieces, which are including different voises. Is it possible to calculate X-vector for each recognized word? If yes, could you please provide some example or guide. (I am using your russian model vosk-model-small-ru-0.15)

Thank you in advance.

nshmyrev commented 2 years ago

Is it possible to calculate X-vector for each recognized word?

No, the recommended length for reliable x-vector extraction is 10 seconds.

russele7 commented 2 years ago

@nshmyrev Nickolay, thank you for the answer and could you please clarify: What is we take a short recording with one word (duration for example 1 second) and multiply it to a duration of 10 seconds (it turns out that we play the original recording 10 times). Is it correctly to do like that? Will the accuracy of the X-vector calculation be preserved?

nshmyrev commented 2 years ago

Is it correctly to do like that?

No

balaji7363 commented 2 years ago

how to identify who is speaker?

please help me Thanks in advance

brvier commented 1 year ago

So there is anyway to identify speaker when there are multiple people speaking (an interview for example) ?

russele7 commented 1 year ago

@brvier ,@balaji7363, Hello. VOSK-recognizer has an option, when together with speech recognition algorithm also calculate 128 additional coefficients for each phrase of audio. These coefficients is are properties of the voise whish speak the phrase. So each voise of audio should has specifical combination of these 128 coefficients. You can apply some algorithm of Machine Learning for clusterization to separate phrases of each voise. If you know the number of speakers than you can try k-means clusterization. Otherwise you can try DBSCAN clusterization. They are most popular and easy algorithms for clusterization. You can try to find more difficult and more accurate methods.

omar5477 commented 2 months ago

2. transcribe with live audio? hello im trying to implement real time transcribing with speaker vosk identification which define who is speaker 1 or 2 my application is running but the problem the speaker is not accurately can you please help me?