alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.84k stars 1.09k forks source link

How to interpret speaker id given output of test_speaker.py #1396

Closed Caet-pip closed 1 year ago

Caet-pip commented 1 year ago

example output: this is transcription of a youtube video : https://www.youtube.com/watch?v=SXGtFXRz2Lg Text: given how it goes against my name is sam houser the healthiest X-vector: [0.464815, -1.90765, 1.721063, -0.058873, 0.112122, -0.157388, 0.345781, -1.529996, 0.203998, 0.526525, 0.520173, 0.032436, -0.059312, -1.448676, -0.373182, 1.3598, 0.153305, 0.28401, -0.401323, 0.245897, -1.473901, 1.135092, -1.606027, -1.360494, 1.100037, 0.567094, 0.716901, -0.464215, 0.504879, -0.584588, 1.001908, 1.437604, 1.166166, 0.436322, -1.767293, 0.461723, 0.854106, 1.053077, 0.262169, 1.127515, -0.424773, -0.125869, 1.629272, -0.434885, 0.613825, -0.480744, -0.789676, -1.094092, 1.798493, 0.281596, 0.041654, 0.500181, -1.200724, -1.251635, -1.804671, 0.139075, 1.634623, 1.801339, 0.842007, -1.38078, -1.418169, -0.268587, 0.681391, 0.033919, 1.406015, 0.517243, 0.349176, 0.202597, -0.003295, -1.315867, -0.536843, 0.766411, 0.773979, 0.391754, -0.24479, 2.265621, 0.427111, -1.110245, 0.422459, 1.515973, -1.619512, 0.304797, -1.612, -0.693839, 0.331945, -0.257472, 1.681183, -0.119251, -0.469496, -0.768261, 0.111422, 0.171961, 0.3012, 1.348406, 1.41776, 3.060665, -0.894705, -0.020721, 0.655986, -1.1154, 1.22471, 0.202716, 0.731662, -1.304038, 1.495937, -0.045493, -1.018393, -0.274046, -1.438403, -1.105816, -0.27321, -1.681556, -1.169248, -0.401588, 1.865974, 1.460877, -1.141696, 0.117315, -0.258752, 0.215494, -0.014842, -0.24107, -0.589189, 0.760386, 0.563039, -1.29448, 0.447073, -0.191942] Speaker distance: 0.9699811595718226 based on 303 frames LOG (VoskAPI:~CachingOptimizingCompiler():nnet-optimize.cc:710) 0.00501 seconds taken in nnet3 compilation total (breakdown: 0.00501 compilation, 0 optimization, 0 shortcut expansion, 0 checking, 0 computing indexes, 0 misc.) + 0 I/O. Text: as a book called from from the as love and it's return it because it was around size and the size not available X-vector: [-1.204846, -0.898753, 2.785844, -0.912218, 0.219721, -1.719704, 0.643861, -0.392169, 0.427482, 0.452078, 0.907587, -0.618629, 1.223167, -0.377151, 0.78743, 1.748574, -0.353695, 0.667813, -0.304847, 0.294658, 0.501684, -0.192875, 1.195725, 1.288609, -0.318152, 0.434202, 0.117976, -0.26447, 0.30446, 0.100406, 0.828961, 0.291019, -1.303682, 0.856536, 0.939398, 3.310984, 0.193254, 0.283365, 1.350362, 0.234929, -0.512757, 0.106699, 0.935026, -1.726255, -1.008779, -0.061726, -0.552396, -0.030794, -0.210984, 1.057081, 0.694199, -0.575876, -2.02285, -1.169062, -0.125201, -1.205397, -1.400767, 1.166652, -0.090379, -0.415274, 0.975145, 1.389556, 0.158879, 0.397279, 0.35169, 0.885883, -0.844625, -1.349642, 0.162851, -1.487563, 0.18416, -0.30577, -0.969822, 1.580805, -0.144747, 0.814953, -0.376449, 1.098513, -0.403689, -0.598828, -1.707152, 0.483546, -0.584533, -1.710444, 0.734886, 1.492739, 0.114992, -2.099648, -0.267751, -1.150931, -0.190341, 2.4694, -0.707109, 0.778992, -0.908344, 1.045355, 1.323595, 1.151674, -0.031629, -0.721309, 0.030275, -0.599249, 0.273303, 1.383225, 1.825816, -1.33909, -1.198679, 0.122582, -0.11549, -1.761748, -1.49878, -1.606291, -0.562747, -0.188545, -0.593051, -0.056335, 0.013723, -1.544675, -0.038081, 0.178501, 1.034716, -0.700002, -1.470005, -0.26878, -0.440276, -0.152373, 0.125389, 0.703633] Speaker distance: 0.7920246650112686 based on 651 frames LOG (VoskAPI:~CachingOptimizingCompiler():nnet-optimize.cc:710) 0.0156 seconds taken in nnet3 compilation total (breakdown: 0.0156 compilation, 0 optimization, 0 shortcut expansion, 0 checking, 0 computing indexes, 0 misc.) + 0 I/O. Text: avid receive an email from you and wondering if my payment has been funded X-vector: [-0.650169, -1.223591, 2.767466, -0.494385, 0.715284, -0.665128, 0.887355, 0.53964, -0.00326, 1.289442, -0.014705, -0.015092, 0.829081, 0.324388, 0.793968, 2.174463, 1.321632, 1.90116, -0.577346, 0.686377, 0.077606, -0.850515, 0.075722, 0.329665, 1.647757, 1.359601, 1.02766, 0.629414, 0.052442, 1.098863, 0.202858, -0.125983, -0.813122, -0.652277, 1.09946, 2.408845, 0.861618, 0.170835, 0.503619, -0.652636, -2.050069, 0.350497, 0.816911, -0.437929, -0.199097, 0.102779, 0.172766, -1.942744, 0.932407, 0.215404, 0.869958, 0.986976, -1.610969, -2.753284, -0.029473, -0.607053, -0.804561, 1.256239, -1.384392, -0.346026, -0.243504, 0.094259, -0.443258, -0.195866, 0.082088, 0.298801, -1.772337, -0.958388, -0.444779, -1.769526, 1.458044, 0.24734, 1.972661, 1.748002, 0.371389, 1.279829, -1.072043, 1.019435, 0.772273, -0.680353, -0.621706, 0.999314, 0.916217, 0.739836, -0.06127, -0.37034, 0.987634, -1.061082, -0.754419, 0.534074, -0.25866, 0.948396, -1.01469, -1.443675, 0.54063, 0.210136, -1.017184, 0.14811, 0.38557, -0.619021, -1.466153, -1.07491, -0.163329, 1.467264, 1.329748, -0.12015, 1.354678, -0.199967, -0.637691, 1.303172, -0.891161, -1.280879, 0.309917, -0.529935, 0.683505, 0.511901, -0.156986, -0.772542, -0.596468, 0.252959, 2.466998, 0.428671, 0.395836, 0.494242, -0.821945, 0.167609, 0.19296, -0.967105] Speaker distance: 0.9641331227961727 based on 441 frames LOG (VoskAPI:~CachingOptimizingCompiler():nnet-optimize.cc:710) 0.0136 seconds taken in nnet3 compilation total (breakdown: 0.0126 compilation, 0 optimization, 0 shortcut expansion, 0 checking, 0 computing indexes, 0.001 misc.) + 0 I/O. Text: seems like it's a lot of like on my couch yep i do apologize for the inconvenience melinda go ahead and check if there are some notes and ordered a deal may have may ask for the original on the online order number okay oh them says or list oh for a moment ago father X-vector: [-0.891653, -0.29478, 2.248965, -0.289813, 0.178471, -1.526459, -0.325981, -0.183466, 1.398953, -0.105055, 0.315682, 0.024962, 0.211938, -1.661646, 0.411296, 1.093914, 0.100714, -2.540077, -0.044491, 1.066044, 0.049124, 1.26613, 0.078279, 0.435458, 1.009032, -0.152431, 1.211141, -1.100716, -1.788212, 0.80617, -1.484393, 0.179818, -0.643831, -0.958402, -2.136472, 0.857522, 0.050922, 1.208333, 0.718239, -0.424354, -0.999734, -0.245781, 0.922534, -0.614824, 1.369977, -2.107441, 0.252689, -0.180775, -0.883977, -0.16348, 1.166248, 0.024695, -3.172954, 0.519688, 1.199615, -0.672276, 0.316222, 0.745327, -1.542266, -0.936811, 1.067355, 0.360523, 0.407422, 0.779042, 1.518261, 0.575192, -0.977163, -0.688699, -0.419232, -1.35514, 0.365726, 0.873962, 0.786109, 0.784424, -0.839086, 0.449379, -1.510559, 0.044052, -0.784394, 0.448382, -0.550311, -0.760558, -2.351515, 2.04004, -0.975615, 1.074272, 2.075982, -0.201826, 0.642645, -0.247127, -0.436003, 0.395312, 0.709274, -0.744411, -0.492896, -0.786147, -0.229771, -0.355455, 0.643584, -0.833716, 0.303833, -0.333836, -1.44951, -1.975393, -0.299941, -0.422221, 0.410396, 0.586135, -1.619727, 1.315935, -0.458488, -0.135504, -0.126564, 1.078133, 1.840289, 0.577564, 0.361293, -0.479572, -1.013131, -0.119151, 0.022602, 1.448236, -0.426447, 1.172518, -0.189978, 0.286676, -0.664156, 0.909737] Speaker distance: 1.041365425724021 based on 1026 frames

nshmyrev commented 1 year ago

It is x-vector, basically a vector in 128-dim space which represent speaker identity:

https://dsp.stackexchange.com/questions/59086/what-are-i-vectors-and-x-vectors-in-the-context-of-speech-recognition

Basically same issue as https://github.com/alphacep/vosk-api/issues/405