I have many audio files with human speech.
I want to group it by speaker.
For the test I get one long file (about 18 minutes) and get embedings for it (about 80 vectors). It meas each vector has about 14 second audio.
My idea was find centroid and next step find similar centroid from other files by cosine distance with some threshold value.
But before I need set value of threshold. I am trying to compare each vector with other from one file to get avarage distance.
But I got very different distances. And noted that all vectors too different. Why it can be?
I have many audio files with human speech. I want to group it by speaker. For the test I get one long file (about 18 minutes) and get embedings for it (about 80 vectors). It meas each vector has about 14 second audio. My idea was find centroid and next step find similar centroid from other files by cosine distance with some threshold value. But before I need set value of threshold. I am trying to compare each vector with other from one file to get avarage distance. But I got very different distances. And noted that all vectors too different. Why it can be?