alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
8.06k stars 1.11k forks source link

Using cosine distance for speaker identification #99

Closed sravanco closed 4 years ago

sravanco commented 4 years ago

I'm just getting started with ML, so please excuse me if this is a noob question.

Checked python example test_speaker.py where one speaker's signature is taken(generated randomly?) and his signature's cosine distance with the resulting text's is calculated. But I did not get how this cosine distance can be used to know if the next text is from the same speaker or a new person.

In my case the persons/actors are new per audio file.

nshmyrev commented 4 years ago

In my case the persons/actors are new per audio file.

Our code expects that the speaker is already known. You can get speaker signature by recognizing a sample speaker recording.

dpny518 commented 4 years ago

do you have the kaldi recipe for training the speaker model, is it xvector or ivector?

nshmyrev commented 4 years ago

do you have the kaldi recipe for training the speaker model, is it xvector or ivector?

You can take sid xvector recipe

nshmyrev commented 4 years ago

Same as https://github.com/alphacep/vosk-api/issues/135, this should work.

peterkronenberg commented 3 years ago

@nshmyrev Have been playing some more with the speaker identification and I think I'm starting to understand it now. What values of cosine distance should be used to assume it is the same speaker? If I have a bunch of speaker signature in a database, for example, I'd have to go through each one and compute the dot product and normal values for each and then compare the cosine distance. I hope this operation isn't too expensive. Is there any shortcut for comparing a bunch of signature? Looks like the norm value can be computed in advance and saved in the database, but that's about it.

milind-soni commented 3 years ago

@peterkronenberg can you please tell me how did you calculate speaker signatures?

peterkronenberg commented 3 years ago

I never really figured out how to get it to work reliably. You can find code online for calculating the cosine distance. The problem is that no matter what threshold I used, I couldn't reliably tell if it was a different speaker or not. I didn't get any responses to the other issues I opened. I'm sure I'll have to get back to this some day.

milind-soni commented 3 years ago

I never really figured out how to get it to work reliably. You can find code online for calculating the cosine distance. The problem is that no matter what threshold I used, I couldn't reliably tell if it was a different speaker or not. I didn't get any responses to the other issues I opened. I'm sure I'll have to get back to this some day.

Oh well I found Resemblyzer to be useful but cant seem to get it running for longer files

peterkronenberg commented 3 years ago

Does that package calculate a more accurate vector? If you do make any progress, please post here. Would love to hear it

milind-soni commented 3 years ago

Yeah Resemblyzer calculates vectors and even labels it automatically but crashes for lengthy files This is my attempt at batching long files and diarizing every segment . Working on it to combine it with vosk API to generated speaker separated transcriptions. Any help will be appreciated to improve this.