akashmjn / tinydiarize

Minimal extension of OpenAI's Whisper adding speaker diarization with special tokens
MIT License
421 stars 14 forks source link

tinydiarize 🐥🗣️

Demo

https://user-images.githubusercontent.com/13268767/229617067-eca0f614-d334-480d-9801-7c30d88acdc6.mp4

You can try it out on other such gems from YouTube using this notebook. Open In Colab

Quickstart

Install ffmpeg following the original repo, then run:

pip install -e .
whisper --model small.en-tdrz AUDIO 

The only change is the small.en-tdrz model instead of small.en. That's it! 🎉

What's included?

We aim to demonstrate a starting point enabling anyone (or even OpenAI themselves!) to improve performance and extend support (multilingual, speech translation etc.).

Performance

metric small.en small.en-tdrz
spk_turn_precision - 97.7
spk_turn_recall - 70.8
wer_overall 11.0 10.3
wer_speaker_switch 15.0 15.5

On a (tiny) benchmark set of 3 earnings calls, tdrz gets near-perfect speaker turn precision at fairly decent recall. A similar WER is retained as the original model. Not too shabby for a tiny finetuning setup, and <10% extra inference cost!

Refer to tdrz_dev for details on performance analysis and comparisons.

More info

Gotchas

Note that this still an early proof-of-concept and there are a few things to be aware of:

Roadmap

* is a pointer to the current state of the repo. Please see https://github.com/akashmjn/tinydiarize/issues/14 for an update on plans. TLDR; things have had to be put on pause :/

References

[1] Joint Speech Recognition and Speaker Diarization via Sequence Transduction [2] Serialized Output Training for End-to-End Overlapped Speech Recognition [3] Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection [4] Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

For information on the underlying Whisper model, please refer to the original documentation (release: 20230308)

License

Code and model weights are released under the MIT License. See LICENSE for further details.

Citation

If you please to use this in your research, you can cite this work as

@software{mahajan2023tinydiarize,
  author = {Mahajan, Akash},
  month = {08},
  title = {tinydiarize: Minimal extension of Whisper for speaker segmentation with special tokens},
  url = {https://github.com/akashmjn/tinyDiarize},
  year = {2023}
}