SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
11.38k stars 950 forks source link

Forced-Alignment #329

Open zh-plus opened 1 year ago

zh-plus commented 1 year ago

Currently, I am exploring how to use faster-whisper for performing forced-alignment between audio and ground-truth transcription texts. I found WhisperModel.find_alignment available for this purpose. But I got stuck at the last step (maybe):

import ctranslate2
import numpy as np
import tokenizers

from faster_whisper import download_model, decode_audio
from faster_whisper.feature_extractor import FeatureExtractor
from faster_whisper.tokenizer import Tokenizer
from faster_whisper.transcribe import get_ctranslate2_storage

def encode(model, features):
    to_cpu = model.device == "cuda" and len(model.device_index) > 1

    features = np.expand_dims(features, 0)
    features = get_ctranslate2_storage(features)

    return model.encode(features, to_cpu=to_cpu)

def detect_language(encoded_feature_seg, model):
    results = model.detect_language(encoded_feature_seg)[0]
    # Parse language names to strip out markers
    all_language_probs = [(token[2:-2], prob) for (token, prob) in results]
    # Get top language token and probability
    language, language_probability = all_language_probs[0]

    return language, language_probability

audio = decode_audio('test_audio.wav', sampling_rate=16000)
transcription = 'this is a test transcription'
model_size = "large-v2"

# Get model
model_path = download_model(
    model_size,
    local_files_only=False,
    cache_dir=None,
)
model = ctranslate2.models.Whisper(
    model_path,
    device='cuda',
    compute_type='float16',
)

# Get audio features
feature_extractor = FeatureExtractor()
features = feature_extractor(audio)

# Detect language
encoded_feature_seg = encode(model, features[:, : feature_extractor.nb_max_frames])
language, _ = detect_language(encoded_feature_seg, model)

# Setup tokenizer
hf_tokenizer = tokenizers.Tokenizer.from_pretrained(
    "openai/whisper-tiny" + ("" if model.is_multilingual else ".en")
)
tokenizer = Tokenizer(
    hf_tokenizer,
    model.is_multilingual,
    task='transcribe',
    language=language,
)

# Tokenize ground-truth
transcription_tokens = tokenizer.encode(transcription)

# Segment audio
audio_feature_segments = [features[:, seek: seek + feature_extractor.nb_max_frames]
                          for seek in range(0, features.shape[1], feature_extractor.nb_max_frames)]

# Get stuck here, how to segment the transcription into chunks (should align with the audio segment)?
# And how can I use WhisperModel.find_alignment to obtain the alignment? 
# It is not elegant to extract the code into a separate function.

The main issue is how to split the ground-truth transcription into small segments aligned with the audio. Extracting the WhisperModel.find_alignment function is not great, but at least I can give it a try.

May I ask for your suggestions? Thank you for your fantastic project!

chainyo commented 1 year ago

Hi @zh-plus, this function is automatically used when you use the word_timestamps parameter, an alignment is done before getting the words.

On my end, I saw that using by default the word_timestamps parameter provides outputs with much more quality and precision. You can always filter out the words if you don't need them at the end.

Btw, you have to use the original transcribe function which could not be what you are looking for.