Open peterkronenberg opened 3 years ago
Not sure what test_srt is doing
It creates srt file (common file format for subtitles)
Test_speaker looks like it’s a way to identify who is talking. Is there any additional documentation on that?
There are few closed issues which explain the details
In general we might have some more detailed docs a bit later.
I looked at all the closed issues, but there doesn't seem to be any straight-forward guide to how to get started and what everything does. Anything you can provide here, until additional documentation is available, would be appreciated. Some comments in the code would also help
For example, what does this string represent? spk_sig = [-1.110417,0.09703002,1.35658,0.7798632,-0.305457,-0.339204,0.6186931,-0.4521213,0.3982236,-0.004530723,0.7651616,0.6500852,-0.6664245,0.1361499,0.1358056,-0.2887807,-0.1280468,-0.8208137,-1.620276,-0.4628615,0.7870904,-0.105754,0.9739769,-0.3258137,-0.7322628,-0.6212429,-0.5531687,-0.7796484,0.7035915,1.056094,-0.4941756,-0.6521456,-0.2238328,-0.003737517,0.2165709,1.200186,-0.7737719,0.492015,1.16058,0.6135428,-0.7183084,0.3153541,0.3458071,-1.418189,-0.9624157,0.4168292,-1.627305,0.2742135,-0.6166027,0.1962581,-0.6406527,0.4372789,-0.4296024,0.4898657,-0.9531326,-0.2945702,0.7879696,-1.517101,-0.9344181,-0.5049928,-0.005040941,-0.4637912,0.8223695,-1.079849,0.8871287,-0.9732434,-0.5548235,1.879138,-1.452064,-0.1975368,1.55047,0.5941782,-0.52897,1.368219,0.6782904,1.202505,-0.9256122,-0.9718158,-0.9570228,-0.5563112,-1.19049,-1.167985,2.606804,-2.261825,0.01340385,0.2526799,-1.125458,-1.575991,-0.363153,0.3270262,1.485984,-1.769565,1.541829,0.7293826,0.1743717,-0.4759418,1.523451,-2.487134,-1.824067,-0.626367,0.7448186,-1.425648,0.3524166,-0.9903384,3.339342,0.4563958,-0.2876643,1.521635,0.9508078,-0.1398541,0.3867955,-0.7550205,0.6568405,0.09419366,-1.583935,1.306094,-0.3501927,0.1794427,-0.3768163,0.9683866,-0.2442541,-1.696921,-1.8056,-0.6803037,-1.842043,0.3069353,0.9070363,-0.486526]
I don't necessarily want all the mathematics behind it, but is this just a hard-coded things that we need? Or this this the signature for a specific speaker that we're looking for?
What do the numbers in the resultant x-vector represent? Does the speaker distance refer to a confidence factor? How close should 2 distances be before we assume it's the same speaker?
I'm playing with the added support for speaker recognition and I have a couple of questions.
spk_frames
and how should it be interpreted (e.g. translated to time in seconds)? Is this the detected voice activity, based on which the x-vectors are computed?speaker change detection
, i.e. not having a pre-populated database with known x-vectors? Would doing a cosine distance to the previous utterance's x-vector suffice (i.e. speaker changed
if distance is above a threshold, otherwise assume the same speaker and add the x-vector to the average)? Thanks for the good work.
@peterkronenberg Have you figured out how to interpret the output?
Android demo here:
Is there any additional documentation or description of the Python test files? Some are pretty obvious. Some are not.
Transcript_scp.py is in a separate directory, python/test. What does that one do? There doesn't seem to be a sample input files it can use
And then there is test_srt and test_speaker. Not sure what test_srt is doing. Test_speaker looks like it’s a way to identify who is talking. Is there any additional documentation on that? Is it simply estimating relative distances of the speaker? There’s a big hard-code array in there. Not sure if that's something I would need