argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
http://argmaxinc.com/blog/whisperkit
MIT License
3.92k stars 331 forks source link

[FEATURE REQUEST] custom prompt to pass vocabulary words #53

Closed ldenoue closed 7 months ago

ldenoue commented 8 months ago

Like OpenAI's Whisper, is it possible to pass a text prompt which could be used to improve the quality of the transcript?

See https://cookbook.openai.com/examples/whisper_prompting_guide#pass-names-in-the-prompt-to-prevent-misspellings

Laurent

atiorh commented 8 months ago

This is the only remaining feature before we are feature-complete with respect to the OpenAI API. We will implement this before 1.0. Thank you for bringing this up!

atiorh commented 8 months ago

Note that the Core ML decoder is unable to do a forward pass with multiple tokens in one forward pass at the moment so we need to "decode the prompt" one 1 token at a time. This will likely result in a slowdown in the short term for long prompts. When we bring up the MLX backend, it shouldn't be a problem at all.

ZachNagengast commented 8 months ago

Yep, good callout @ldenoue, this is definitely needed for parity, and we have been tracking todos for when it is available.

We built a look-up table to address this for the common task and language combinations which is the usePrefillCache option, but arbitrary text prompts will require either generating the cache 1 token at a time like @atiorh mentioned, or a new model that can generate prompt caches in a single forward pass, which will likely come from integrating MLX #33. See this thread for a similar discussion of the issue https://github.com/huggingface/transformers/issues/23845#issue-1731010774

In the meantime the simplest way to go about this would be opening up the prefill prompt tokens to be set via DecodingOptions directly, which would enable arbitrary prompts including startofprev and custom vocabulary words as you requested, but would require a forward pass for each token, what do you think?