aiola-lab / whisper-medusa

Whisper with Medusa heads
MIT License
800 stars 49 forks source link

Inquiry: Comparison with alternatives like faster-whisper #7

Closed crazoter closed 1 month ago

crazoter commented 3 months ago

I'm curious to know, based on what is written in the github readme:

  1. various optimization strategies like Faster-Whisper and Speculative Decoding have been proposed to enhance performance. How does whisper-medusa fare against these existing alternatives (i.e. what is the motivation behind this research as opposed to contributing to these existing open source options)? Have you also tried applying the optimisations from these other libraries onto whisper-medusa?
  2. We train and evaluate our model on the LibriSpeech dataset, demonstrating strong performance with both speed and accuracy improvements. Are you able to share the benchmarking results as well as the context under which these benchmarks are performed (e.g. just english, multilingual contexts, model size etc)

Thank you!

AvivNavon commented 3 months ago

Hi @crazoter:

  1. faster-whisper provides the fastest alternative with up to x4 faster decoding compared to openai/whisper, generally with some loss of accuracy. Our implementation provides only ~x1.5. We added the full details on the readme. We hope that in the future, Whisper Medusa can be combined with faster-whisper to achieve greater speedup. Additionally, we keep working on improving the model by increasing the number of medusa heads and more.
  2. The previous statement we had on the readme is not accurate. It is fixed now. The current released model achieves on-par WER on the LibriSpeech dataset (4.2% vs 4% of Whisper-large-v2). We are working on training and evaluating models on different datasets as well.
Niko-La commented 3 months ago

+1 for also whisperx comparison. @AvivNavon

Jiltseb commented 2 months ago

@AvivNavon Thanks for the contribution. Is it possible to convert this model to ctranslate2 format, like hugging face whisper models?

I observed that this implementation suffers lower accuracy (higher WER) than other implementations. The major transcription errors happen towards the end of the segment. What are the parameters that can be fed to the model? Is it possible to use without_timestamps, max_len parameters? I think these can have an effect on the final WER.

Thanks in advance!

YaelSegalFeldman commented 2 months ago

Hi @Jiltseb, since the default max_len parameter for Hugging Face is quite large (448), we approximated a linear relationship between audio duration and the number of predicted tokens using the dev clean and dev other datasets. We then used this approximate line, with an added buffer, to set the max_len parameter when evaluating both our Medusa model and the Whisper model.

Jiltseb commented 2 months ago

Thanks, @YaelSegalFeldman This would be fine when testing with audios having normal speaking rate, but can have issues when syllables/sec become larger. Is there a way users can configure max_len or get the end of sentence token?

YaelSegalFeldman commented 2 months ago

@Jiltseb, each user can control the max_len parameter by setting it when calling the generate function. Here is a code example:

model_output = model.generate(
    input_features,
    language=language,
    max_length=user_max_len
)