Whisper Timestamps - Githubissues

makaveli10 commented 9 months ago

Hello, The current Decoder in Whisper example uses FasterTransformer DynamicDecoder which I think doesn't output the timestamp tokens for Whisper. Are there any plans for supporting the timestamp feature for Whisper.

Thanks

yuekaizhang commented 9 months ago

Hi, thank you for your interest in the timestamp feature for Whisper. Currently, we do not have plans to integrate the official Whisper timestamping directly, mainly due to a couple of considerations:

As you mentioned, the current FasterTransformer DynamicDecoder design is tailored towards a GenerationSession, which differs from the session type required for Whisper timestamping.
Additionally, OpenAI's implementation incorporates a custom openai Triton kernel for Dynamic Time Warping (DTW) algorithms, which would also require integration.

However, it's worthwhile to note that there are alternative methods to obtain word-level timestamps from transcripts. And they are more efficent and reliable comparing with whisper's decoder (attention scores from encoder decoder may be not accurate for word time stamps).

For example, the Montreal Forced Aligner (MFA) provides a way to align transcripts with audio at the word level, and this functionality is well-documented and accessible through their documentation: Montreal Forced Aligner Documentation.

Another option is the Forced Alignment with Wav2Vec2 provided by PyTorch Audio, which is detailed in their tutorial here: Forced Alignment with Wav2Vec2 — Torchaudio 2.1.1 documentation.

makaveli10 commented 9 months ago

Thanks for the swift response, I'll test these approaches. My concern is that if I use another Aligner on top of Whisper that would some latency to the transcription which would be troublesome. Anyway, I will test the both and report here.

shashikg commented 7 months ago

Hi @makaveli10, thought this might be helpful for your evaluation: https://github.com/shashikg/WhisperS2T/releases/tag/v1.2.0

I ran a similar experiment where I compared the inference time of wav2vec Force Aligner (whisperX) against using whisper DTW for alignment. I believe whisper also provides decent alignment accuracy if you're not aiming for very high resolution (e.g., accuracy at the 100ms level). My observations showed that using whisper large-v2 for transcription and whisper-tiny for alignment works best.

NVIDIA / TensorRT-LLM

Whisper Timestamps #647