Closed sravanco closed 4 years ago
In my case the persons/actors are new per audio file.
Our code expects that the speaker is already known. You can get speaker signature by recognizing a sample speaker recording.
do you have the kaldi recipe for training the speaker model, is it xvector or ivector?
do you have the kaldi recipe for training the speaker model, is it xvector or ivector?
You can take sid xvector recipe
Same as https://github.com/alphacep/vosk-api/issues/135, this should work.
@nshmyrev Have been playing some more with the speaker identification and I think I'm starting to understand it now. What values of cosine distance should be used to assume it is the same speaker? If I have a bunch of speaker signature in a database, for example, I'd have to go through each one and compute the dot product and normal values for each and then compare the cosine distance. I hope this operation isn't too expensive. Is there any shortcut for comparing a bunch of signature? Looks like the norm value can be computed in advance and saved in the database, but that's about it.
@peterkronenberg can you please tell me how did you calculate speaker signatures?
I never really figured out how to get it to work reliably. You can find code online for calculating the cosine distance. The problem is that no matter what threshold I used, I couldn't reliably tell if it was a different speaker or not. I didn't get any responses to the other issues I opened. I'm sure I'll have to get back to this some day.
I never really figured out how to get it to work reliably. You can find code online for calculating the cosine distance. The problem is that no matter what threshold I used, I couldn't reliably tell if it was a different speaker or not. I didn't get any responses to the other issues I opened. I'm sure I'll have to get back to this some day.
Oh well I found Resemblyzer to be useful but cant seem to get it running for longer files
Does that package calculate a more accurate vector? If you do make any progress, please post here. Would love to hear it
Yeah Resemblyzer calculates vectors and even labels it automatically but crashes for lengthy files This is my attempt at batching long files and diarizing every segment . Working on it to combine it with vosk API to generated speaker separated transcriptions. Any help will be appreciated to improve this.
I'm just getting started with ML, so please excuse me if this is a noob question.
Checked python example test_speaker.py where one speaker's signature is taken(generated randomly?) and his signature's cosine distance with the resulting text's is calculated. But I did not get how this cosine distance can be used to know if the next text is from the same speaker or a new person.
In my case the persons/actors are new per audio file.