Whisper Notebook

This Colab Notebook is designed to support OpenAI Whisper, ctranslate2, wav2vec 2.0, Silero VAD and translation (DeepL) API, aiming to generate ACICFG-opinionated human-comparable results for translation, transcription, and timestamping.

Usage

Click the button to open Faster Whisper Notebook in Google Colab and follow instructions inside.

Versions

Whisper.ipynb: The first attempt.

WhisperX.ipynb: A custom version of WhisperX has been adopted for better voice activity detection performance.

Faster_Whisper_Public.ipynb: Rewritten with Faster-Whisper with built-in Silero VAD for faster inference.

Technology

This repo utilized the following technologies:

Discussion

Model Size

Certain regression on large model could be observed - this behaviour may be caused by VAD cutoff.

Timestamping

ACICFG employs a very opinionated way of timestamping:

Max English characters per line = 120
Minimal time on screen per line = 1.5s

You should adjust those values accordingly.

VAD Model

The author introduced Silero VAD V4 for better performance than pyannote commit 30794f4 - same technology is adopted in Faster-Whisper.

We believe voice activity detection (VAD) provides benefits in several areas:

Improved Efficiency: Our observations indicate VAD can eliminate approximately 15% of non-speech audio from needing to be processed for speech recognition inferencing. By removing sections without vocalizations, VAD reduces computational requirements.
Reduced False Positives: We have observed anomalous outputs from the Whisper largev3 model when provided audio without voice activity. Applying VAD as a pre-processing step before inference can mitigate these false positive predictions by removing non-speech segments.

LLM vs NMT for translation

Quality

The author evaluated the output quality of English-to-Chinese neural machine translation (NMT) systems, specifically DeepL, on aviation-focused materials. Several key observations were made:

NMT systems made more mistakes translating aviation technical terminology compared to language models. Increasing context length did not help DeepL improve technical term translation accuracy.
NMT systems struggled to consistently translate the same terms identically throughout the documentation set, whereas large language models could better leverage their capacity for longer context with up to 200,000 tokens to address this terminology consistency issue.
The use of separators did not notably impact translation quality for either NMT or large language model approaches.
Both approaches exhibited some content loss, however language models tended to omit more content compared to NMT.
The Mistrial-8x7b model had a propensity to output mixed language content, making it less suited for English-to-Chinese translation.

Seperator persistence

Subtitle lines are segmented into shorter phrases to provide local context for the model, but these segments must be delimited with separators either within or at the end of lines in order to accommodate screen width constraints.

We have observed that large language models (LLMs) like Claude and GPT tend to ignore separators, even after few-shot learning with temperature set to 0. We summarize the behavior of specific models regarding separator persistence below:

Anthropic Claude 1.2 turbo: As expected, the model refuses to complete the translation task, citing "lack of context" and "safety concerns".
Anthropic Claude 2.0 and 2.1: Neither model can reliably persist separators. Both are likely to insert welcome messages and output wrappers, though version 2.1 does this slightly less. Prompting the model to output XML produces more stable results but separators within sentences are still ignored.
GPT 3.5-turbo: Cannot reliably persist separators.
ChatGPT: Performed summarization rather than translation. Does not persist separators.
Mistrial 8x7B: Performs better at preserving separators.

In contrast, neural machine translation (NMT) models tend to persist separators more reliably. We found DeepL may replace separators with 2 new lines, which can be mitigated by extending the separator length.

Therefore, we currently recommend NMT for subtitle translation. We welcome further testing of prompting techniques and modern models.

Potential Issues

Glossary Support

There is no glossary support for Chinese yet as of writing.

API Rate Limiting

API used here is strictly for demo and non-for-profit purpose. Reach out to author privately for further assistance.

Author

Beining, @cnbeining , coded this thing
iSS, @sailordiary , technical and emotional support

This repo is a product of ACI Chinese Fansub Group.

本代码库由ACI字幕组技术部编写。

License

GPL.

cnbeining / Whisper_Notebook

readme