Add Whisper feature extraction methods

iyaja commented 1 year ago

This PR integrates adds methods to obtain Whisper input features, embeddings, and transcripts from an AudioSignal.

Introduces the WhisperMixin class with methods to set up the Whisper model, obtain input features, generate transcripts, and extract embeddings.
The setup_whisper() method initializes the Whisper model and processor using the transformers library.
The get_whisper_features() method resamples the input signal to the required sampling rate, processes the raw speech, and returns the input features.
The get_whisper_embeddings() method extracts the embeddings from the input features using the Whisper encoder.
The get_whisper_transcript() method generates the transcript from the input features and decodes it into text.

This new functionality allows developers to leverage the Whisper model for a wide range of audio processing tasks within the audiotools library.

sotelo commented 1 year ago

@pseeth could you also review this PR? There's things for which I lack context, for instance if we should implement this as a Mixin class?

sotelo commented 1 year ago

Thank you @pseeth !

descriptinc / audiotools