Closed ldenoue closed 7 months ago
This is the only remaining feature before we are feature-complete with respect to the OpenAI API. We will implement this before 1.0. Thank you for bringing this up!
Note that the Core ML decoder is unable to do a forward pass with multiple tokens in one forward pass at the moment so we need to "decode the prompt" one 1 token at a time. This will likely result in a slowdown in the short term for long prompts. When we bring up the MLX backend, it shouldn't be a problem at all.
Yep, good callout @ldenoue, this is definitely needed for parity, and we have been tracking todos for when it is available.
We built a look-up table to address this for the common task and language combinations which is the usePrefillCache
option, but arbitrary text prompts will require either generating the cache 1 token at a time like @atiorh mentioned, or a new model that can generate prompt caches in a single forward pass, which will likely come from integrating MLX #33. See this thread for a similar discussion of the issue https://github.com/huggingface/transformers/issues/23845#issue-1731010774
In the meantime the simplest way to go about this would be opening up the prefill prompt tokens to be set via DecodingOptions
directly, which would enable arbitrary prompts including startofprev and custom vocabulary words as you requested, but would require a forward pass for each token, what do you think?
Like OpenAI's Whisper, is it possible to pass a text prompt which could be used to improve the quality of the transcript?
See https://cookbook.openai.com/examples/whisper_prompting_guide#pass-names-in-the-prompt-to-prevent-misspellings
Laurent