Toolkit for Enhanced Voice Training Datasets
Note: realtime_diarize.py required changes to RealtimeSTT. Please upgrade to latest version.
WhoSpeaks emerged from the need for better speaker diarization tools. Existing libraries are heavyweight and often fall short in reliability, speed and efficiency. So this project offers a more refined alternative.
Hint: Anybody interested in state-of-the-art voice solutions please also have a look at Linguflex. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.
Here's the core concept:
These steps allow us to match any sentence against the established speaker profiles with remarkable precision.
Note: auto_diarize is for multiple speakers, speaker_diarize is for two speakers only
I initially developed this as a personal project, but was astounded by its effectiveness. In my first tests it outperformed existing solutions like pyannote audio in both reliability and speed while being the more lightweight approach. For me it could be a significant step up in voice diarization capabilities, that's why I've decided to release this rather raw, yet powerful code for others to experiment with.
To demonstrate WhoSpeaks' capabilities, I made a test using a challenging audio sample: the 4:38 Coin Toss scene from "No Country for Old Men". In this scene, the two male speakers have very similar voice profiles, presenting a difficult scenario for diarization libraries.
fetch_youtube_mp3.py
, download the MP3 from the scene's YouTube video.pyannote_diarize.py
(from pyannote audio) and set the speaker parameters to 2.
split_dataset.py
with tiny.en
for efficiency, though large-v2
offers higher accuracy.convert_wav.py
.auto_diarize.py
and visually inspect the dendrogram file to confirm the presence of two speakers.To run auto_diarize.py
and speaker_diarize.py
it is necessary to set the environment variable COQUI_MODEL_PATH to the path containing the "v2.0.2" model folder for coqui XTTS.
The effectiveness of WhoSpeaks in this test, particularly against pyannote audio, showcases its potential in handling complex diarization scenarios with high accuracy and efficiency.