elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)
Apache License 2.0
1.27k stars 90 forks source link

Add support for Whisper timestamps and task/language configuration #238

Closed jonatanklosko closed 10 months ago

jonatanklosko commented 10 months ago

This adds the following options to the speech-to-text serving: :timestamps, :language, :task. By default no language is assumed and the model infers it on its own.

I deprecated Audio.speech_to_text in favour of Audio.speech_to_text_whisper. Initially I added extra_options: [...] to the serving and delegated some of the logic to the Text.Generation behaviour that Audio.Whisper implements, but that could be confusing form the user perspective, since they would need to lookup options in other modules. Also, there were still some Whisper-specific bits, so I think it's most practical to have a separate serving.

We have the %Text.GenerationConfig{} struct for loading generation options (sequence length, some token information, sampling options), but Whisper has a number of specific options on its own. I don't think it makes sense to add those fields to the generic struct, so instead we load the config as %Text.Generation{extra_config: %Text.WhisperGeneration{}}.

jonatanklosko commented 10 months ago

Are we good to ship a new Nx too? :)

I will look at streaming next, so if we want to play safe, we can wait for that :)