Closed crazoter closed 1 month ago
Hi @crazoter:
+1 for also whisperx comparison. @AvivNavon
@AvivNavon Thanks for the contribution. Is it possible to convert this model to ctranslate2 format, like hugging face whisper models?
I observed that this implementation suffers lower accuracy (higher WER) than other implementations. The major transcription errors happen towards the end of the segment. What are the parameters that can be fed to the model? Is it possible to use without_timestamps
, max_len
parameters? I think these can have an effect on the final WER.
Thanks in advance!
Hi @Jiltseb, since the default max_len parameter for Hugging Face is quite large (448), we approximated a linear relationship between audio duration and the number of predicted tokens using the dev clean and dev other datasets. We then used this approximate line, with an added buffer, to set the max_len parameter when evaluating both our Medusa model and the Whisper model.
Thanks, @YaelSegalFeldman This would be fine when testing with audios having normal speaking rate, but can have issues when syllables/sec become larger. Is there a way users can configure max_len
or get the end of sentence
token?
@Jiltseb, each user can control the max_len parameter by setting it when calling the generate function. Here is a code example:
model_output = model.generate(
input_features,
language=language,
max_length=user_max_len
)
I'm curious to know, based on what is written in the github readme:
various optimization strategies like Faster-Whisper and Speculative Decoding have been proposed to enhance performance.
How does whisper-medusa fare against these existing alternatives (i.e. what is the motivation behind this research as opposed to contributing to these existing open source options)? Have you also tried applying the optimisations from these other libraries onto whisper-medusa?We train and evaluate our model on the LibriSpeech dataset, demonstrating strong performance with both speed and accuracy improvements.
Are you able to share the benchmarking results as well as the context under which these benchmarks are performed (e.g. just english, multilingual contexts, model size etc)Thank you!