Documentation of speaker identification/diarization capabilities

peterkronenberg commented 3 years ago

Is there any additional documentation or description of the Python test files? Some are pretty obvious. Some are not.

Transcript_scp.py is in a separate directory, python/test. What does that one do? There doesn't seem to be a sample input files it can use

And then there is test_srt and test_speaker. Not sure what test_srt is doing. Test_speaker looks like it’s a way to identify who is talking. Is there any additional documentation on that? Is it simply estimating relative distances of the speaker? There’s a big hard-code array in there. Not sure if that's something I would need

nshmyrev commented 3 years ago

Not sure what test_srt is doing

It creates srt file (common file format for subtitles)

Test_speaker looks like it’s a way to identify who is talking. Is there any additional documentation on that?

There are few closed issues which explain the details

In general we might have some more detailed docs a bit later.

peterkronenberg commented 3 years ago

I looked at all the closed issues, but there doesn't seem to be any straight-forward guide to how to get started and what everything does. Anything you can provide here, until additional documentation is available, would be appreciated. Some comments in the code would also help

peterkronenberg commented 3 years ago

For example, what does this string represent? spk_sig = [-1.110417,0.09703002,1.35658,0.7798632,-0.305457,-0.339204,0.6186931,-0.4521213,0.3982236,-0.004530723,0.7651616,0.6500852,-0.6664245,0.1361499,0.1358056,-0.2887807,-0.1280468,-0.8208137,-1.620276,-0.4628615,0.7870904,-0.105754,0.9739769,-0.3258137,-0.7322628,-0.6212429,-0.5531687,-0.7796484,0.7035915,1.056094,-0.4941756,-0.6521456,-0.2238328,-0.003737517,0.2165709,1.200186,-0.7737719,0.492015,1.16058,0.6135428,-0.7183084,0.3153541,0.3458071,-1.418189,-0.9624157,0.4168292,-1.627305,0.2742135,-0.6166027,0.1962581,-0.6406527,0.4372789,-0.4296024,0.4898657,-0.9531326,-0.2945702,0.7879696,-1.517101,-0.9344181,-0.5049928,-0.005040941,-0.4637912,0.8223695,-1.079849,0.8871287,-0.9732434,-0.5548235,1.879138,-1.452064,-0.1975368,1.55047,0.5941782,-0.52897,1.368219,0.6782904,1.202505,-0.9256122,-0.9718158,-0.9570228,-0.5563112,-1.19049,-1.167985,2.606804,-2.261825,0.01340385,0.2526799,-1.125458,-1.575991,-0.363153,0.3270262,1.485984,-1.769565,1.541829,0.7293826,0.1743717,-0.4759418,1.523451,-2.487134,-1.824067,-0.626367,0.7448186,-1.425648,0.3524166,-0.9903384,3.339342,0.4563958,-0.2876643,1.521635,0.9508078,-0.1398541,0.3867955,-0.7550205,0.6568405,0.09419366,-1.583935,1.306094,-0.3501927,0.1794427,-0.3768163,0.9683866,-0.2442541,-1.696921,-1.8056,-0.6803037,-1.842043,0.3069353,0.9070363,-0.486526]

I don't necessarily want all the mathematics behind it, but is this just a hard-coded things that we need? Or this this the signature for a specific speaker that we're looking for?

What do the numbers in the resultant x-vector represent? Does the speaker distance refer to a confidence factor? How close should 2 distances be before we assume it's the same speaker?

peterkronenberg commented 3 years ago

Please see https://github.com/alphacep/vosk-api/issues/428

itzsimpl commented 3 years ago

I'm playing with the added support for speaker recognition and I have a couple of questions.

Can one assume the provided model is language independent?
Could you please explain a little what the meaning of spk_frames and how should it be interpreted (e.g. translated to time in seconds)? Is this the detected voice activity, based on which the x-vectors are computed?
Any hints or ideas on how one could use the speaker x-vectors to do just speaker change detection, i.e. not having a pre-populated database with known x-vectors? Would doing a cosine distance to the previous utterance's x-vector suffice (i.e. speaker changed if distance is above a threshold, otherwise assume the same speaker and add the x-vector to the average)?

Thanks for the good work.

mlcatinit commented 1 year ago

@peterkronenberg Have you figured out how to interpret the output?

nshmyrev commented 1 year ago

Android demo here:

https://github.com/virex-84/VoskIdentification

alphacep / vosk-api

Documentation of speaker identification/diarization capabilities #405