KoljaB / WhoSpeaks

Efficient approach to speaker diarization using voice characteristics extraction
58 stars 6 forks source link

It is really good But #1

Open francqz31 opened 6 months ago

francqz31 commented 6 months ago

Hey @KoljaB , I have tried this tool and it is surprisingly really good. It outperformed pyannote for sure. But I'm really wondering how it can be pushed for 10+ speakers or so. It would be really useful in creating datasets for long audio with 10+ speakers.

KoljaB commented 6 months ago

Thanks, I also was surprised how good it worked. We need a better clustering algorithm than my rather naive "take 10% then mean out and then just take the opposite of that" approach. I'm no expert on this. I discussed this with GPT-4 which came up with stuff like K-means, Hierarchical clustering, or DBSCAN on the distance matrix which is beyond my current scope. But this is where I need to dig a bit deeper: clustering the speaker embeddings similarities into more than two groups would lead to multiple speaker support.

francqz31 commented 6 months ago

yeah definitely we need new speaker diarization methods , pyannote and the others are not doing it well enough. BTW you can also chat with OPUS on here for free https://chat.lmsys.org/ through direct chat, OPUS for sure is better and codes way better if you were to give it the code reference of this repo. Yeah but def you are on the right path with these new ideas. I will dig into it too

KoljaB commented 5 months ago

Just added a clustering method for multiple speaker support (auto_diarize.py)

francqz31 commented 5 months ago

I will try it asap then , I have a good video with 8-10 speakers but I can identify them all , so i will calculate WER in my head lol