alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.82k stars 1.09k forks source link

Documentation of speaker identification/diarization capabilities #405

Open peterkronenberg opened 3 years ago

peterkronenberg commented 3 years ago

Is there any additional documentation or description of the Python test files? Some are pretty obvious. Some are not.

Transcript_scp.py is in a separate directory, python/test. What does that one do? There doesn't seem to be a sample input files it can use

And then there is test_srt and test_speaker. Not sure what test_srt is doing. Test_speaker looks like it’s a way to identify who is talking. Is there any additional documentation on that? Is it simply estimating relative distances of the speaker? There’s a big hard-code array in there. Not sure if that's something I would need

nshmyrev commented 3 years ago

Not sure what test_srt is doing

It creates srt file (common file format for subtitles)

Test_speaker looks like it’s a way to identify who is talking. Is there any additional documentation on that?

There are few closed issues which explain the details

In general we might have some more detailed docs a bit later.

peterkronenberg commented 3 years ago

I looked at all the closed issues, but there doesn't seem to be any straight-forward guide to how to get started and what everything does. Anything you can provide here, until additional documentation is available, would be appreciated. Some comments in the code would also help

peterkronenberg commented 3 years ago

For example, what does this string represent? spk_sig = [-1.110417,0.09703002,1.35658,0.7798632,-0.305457,-0.339204,0.6186931,-0.4521213,0.3982236,-0.004530723,0.7651616,0.6500852,-0.6664245,0.1361499,0.1358056,-0.2887807,-0.1280468,-0.8208137,-1.620276,-0.4628615,0.7870904,-0.105754,0.9739769,-0.3258137,-0.7322628,-0.6212429,-0.5531687,-0.7796484,0.7035915,1.056094,-0.4941756,-0.6521456,-0.2238328,-0.003737517,0.2165709,1.200186,-0.7737719,0.492015,1.16058,0.6135428,-0.7183084,0.3153541,0.3458071,-1.418189,-0.9624157,0.4168292,-1.627305,0.2742135,-0.6166027,0.1962581,-0.6406527,0.4372789,-0.4296024,0.4898657,-0.9531326,-0.2945702,0.7879696,-1.517101,-0.9344181,-0.5049928,-0.005040941,-0.4637912,0.8223695,-1.079849,0.8871287,-0.9732434,-0.5548235,1.879138,-1.452064,-0.1975368,1.55047,0.5941782,-0.52897,1.368219,0.6782904,1.202505,-0.9256122,-0.9718158,-0.9570228,-0.5563112,-1.19049,-1.167985,2.606804,-2.261825,0.01340385,0.2526799,-1.125458,-1.575991,-0.363153,0.3270262,1.485984,-1.769565,1.541829,0.7293826,0.1743717,-0.4759418,1.523451,-2.487134,-1.824067,-0.626367,0.7448186,-1.425648,0.3524166,-0.9903384,3.339342,0.4563958,-0.2876643,1.521635,0.9508078,-0.1398541,0.3867955,-0.7550205,0.6568405,0.09419366,-1.583935,1.306094,-0.3501927,0.1794427,-0.3768163,0.9683866,-0.2442541,-1.696921,-1.8056,-0.6803037,-1.842043,0.3069353,0.9070363,-0.486526]

I don't necessarily want all the mathematics behind it, but is this just a hard-coded things that we need? Or this this the signature for a specific speaker that we're looking for?

What do the numbers in the resultant x-vector represent? Does the speaker distance refer to a confidence factor? How close should 2 distances be before we assume it's the same speaker?

peterkronenberg commented 3 years ago

Please see https://github.com/alphacep/vosk-api/issues/428

itzsimpl commented 3 years ago

I'm playing with the added support for speaker recognition and I have a couple of questions.

Thanks for the good work.

mlcatinit commented 1 year ago

@peterkronenberg Have you figured out how to interpret the output?

nshmyrev commented 1 year ago

Android demo here:

https://github.com/virex-84/VoskIdentification