This adds the following options to the speech-to-text serving: :timestamps, :language, :task. By default no language is assumed and the model infers it on its own.
I deprecated Audio.speech_to_text in favour of Audio.speech_to_text_whisper. Initially I added extra_options: [...] to the serving and delegated some of the logic to the Text.Generation behaviour that Audio.Whisper implements, but that could be confusing form the user perspective, since they would need to lookup options in other modules. Also, there were still some Whisper-specific bits, so I think it's most practical to have a separate serving.
We have the %Text.GenerationConfig{} struct for loading generation options (sequence length, some token information, sampling options), but Whisper has a number of specific options on its own. I don't think it makes sense to add those fields to the generic struct, so instead we load the config as %Text.Generation{extra_config: %Text.WhisperGeneration{}}.
This adds the following options to the speech-to-text serving:
:timestamps
,:language
,:task
. By default no language is assumed and the model infers it on its own.I deprecated
Audio.speech_to_text
in favour ofAudio.speech_to_text_whisper
. Initially I addedextra_options: [...]
to the serving and delegated some of the logic to theText.Generation
behaviour thatAudio.Whisper
implements, but that could be confusing form the user perspective, since they would need to lookup options in other modules. Also, there were still some Whisper-specific bits, so I think it's most practical to have a separate serving.We have the
%Text.GenerationConfig{}
struct for loading generation options (sequence length, some token information, sampling options), but Whisper has a number of specific options on its own. I don't think it makes sense to add those fields to the generic struct, so instead we load the config as%Text.Generation{extra_config: %Text.WhisperGeneration{}}
.