MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
3.75k stars 329 forks source link
asr speaker-diarization speech speech-recognition speech-to-text whisper

Speaker Diarization Using OpenAI Whisper

Build Status GitHub stars GitHub issues GitHub license Twitter Open in Colab

Speaker Diarization pipeline based on OpenAI Whisper

drawing Please, star the project on github (see top-right corner) if you appreciate my contribution to the community!

What is it

This repository combines Whisper ASR capabilities with Voice Activity Detection (VAD) and Speaker Embedding to identify the speaker for each sentence in the transcription generated by Whisper. First, the vocals are extracted from the audio to increase the speaker embedding accuracy, then the transcription is generated using Whisper, then the timestamps are corrected and aligned using ctc-forced-aligner to help minimize diarization error due to time shift. The audio is then passed into MarbleNet for VAD and segmentation to exclude silences, TitaNet is then used to extract speaker embeddings to identify the speaker for each segment, the result is then associated with the timestamps generated by ctc-forced-aligner to detect the speaker for each word based on timestamps and then realigned using punctuation models to compensate for minor time shifts.

Whisper and NeMo parameters are coded into diarize.py and helpers.py, I will add the CLI arguments to change them later

Installation

Python >= 3.10 is needed, 3.9 will work but you'll need to manually install the requirements one by one.

FFMPEG and Cython are needed as prerequisites to install the requirements

pip install cython

or

sudo apt update && sudo apt install cython3
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

# on Windows using WinGet (https://github.com/microsoft/winget-cli)
winget install ffmpeg
pip install -c constraints.txt -r requirements.txt

Usage

python diarize.py -a AUDIO_FILE_NAME

If your system has enough VRAM (>=10GB), you can use diarize_parallel.py instead, the difference is that it runs NeMo in parallel with Whisper, this can be beneficial in some cases and the result is the same since the two models are nondependent on each other. This is still experimental, so expect errors and sharp edges. Your feedback is welcome.

Command Line Options

Known Limitations

Future Improvements

Acknowledgements

Special Thanks for @adamjonas for supporting this project This work is based on OpenAI's Whisper , Faster Whisper , Nvidia NeMo , and Facebook's Demucs

Citation

If you use this in your research, please cite the project:

@unpublished{hassouna2024whisperdiarization,
  title={Whisper Diarization: Speaker Diarization Using OpenAI Whisper},
  author={Ashraf, Mahmoud},
  year={2024}
}