KoljaB / WhoSpeaks

Efficient approach to speaker diarization using voice characteristics extraction
28 stars 4 forks source link

WhoSpeaks

Toolkit for Enhanced Voice Training Datasets

Note: realtime_diarize.py required changes to RealtimeSTT. Please upgrade to latest version.

WhoSpeaks emerged from the need for better speaker diarization tools. Existing libraries are heavyweight and often fall short in reliability, speed and efficiency. So this project offers a more refined alternative.

Hint: Anybody interested in state-of-the-art voice solutions please also have a look at Linguflex. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.

Here's the core concept:

These steps allow us to match any sentence against the established speaker profiles with remarkable precision.

Feature Modules

Note: auto_diarize is for multiple speakers, speaker_diarize is for two speakers only

I initially developed this as a personal project, but was astounded by its effectiveness. In my first tests it outperformed existing solutions like pyannote audio in both reliability and speed while being the more lightweight approach. For me it could be a significant step up in voice diarization capabilities, that's why I've decided to release this rather raw, yet powerful code for others to experiment with.

Performance and Testing

To demonstrate WhoSpeaks' capabilities, I made a test using a challenging audio sample: the 4:38 Coin Toss scene from "No Country for Old Men". In this scene, the two male speakers have very similar voice profiles, presenting a difficult scenario for diarization libraries.

Process:

  1. Download: Using fetch_youtube_mp3.py, download the MP3 from the scene's YouTube video.
  2. Diarization Comparison: Run the scene through pyannote_diarize.py (from pyannote audio) and set the speaker parameters to 2.
    • Pyannote's output was inaccurate, assigning most sentences to one speaker incorrectly.
  3. WhoSpeaks Analysis:
    • Sentence Splitting: Use split_dataset.py with tiny.en for efficiency, though large-v2 offers higher accuracy.
    • Conversion: The MP3 segments are converted to WAV format using convert_wav.py.
    • Diarization: Then run auto_diarize.py and visually inspect the dendrogram file to confirm the presence of two speakers.

To run auto_diarize.py and speaker_diarize.py it is necessary to set the environment variable COQUI_MODEL_PATH to the path containing the "v2.0.2" model folder for coqui XTTS.

Results:

The effectiveness of WhoSpeaks in this test, particularly against pyannote audio, showcases its potential in handling complex diarization scenarios with high accuracy and efficiency.