Open cod3r0k opened 5 days ago
You have to convert it to CT2 first, there are several converters available, you can check CT2 documentation for more information
Great, can you help me more? What is CT2? @MahmoudAshraf97
The backend of Faster Whisper https://github.com/OpenNMT/CTranslate2/
Great. You mean that I do as below:
like transformers
#First, load the SpeechBrain Whisper model and extract its weights.
from transformers import WhisperProcessor, WhisperForConditionalGeneration
processor = WhisperProcessor.from_pretrained("openai/whisper-large")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
# Save the model in Hugging Face format
model.save_pretrained("whisper_huggingface")
processor.save_pretrained("whisper_huggingface")
we do in SB
from speechbrain.pretrained import WhisperASR
whisper = WhisperASR.from_hparams(source="speechbrain/whisper-large", savedir="tmp_whisper")
# Save model weights
model = whisper.modules.model
torch.save(model.state_dict(), "speechbrain_whisper_weights.pth")
from transformers import WhisperForConditionalGeneration
# Load the Hugging Face Whisper model
hf_whisper = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
#map weight
import torch
# Load SpeechBrain weights
speechbrain_weights = torch.load("speechbrain_whisper_weights.pth")
# Load Hugging Face model weights
hf_model_state_dict = hf_whisper.state_dict()
# Map weights from SpeechBrain to Hugging Face
mapped_weights = {}
for name, param in hf_model_state_dict.items():
# Replace this mapping logic with the exact alignment of layers
if name in speechbrain_weights:
mapped_weights[name] = speechbrain_weights[name]
else:
mapped_weights[name] = param # Use original HF weights if no match
# Update Hugging Face model with the mapped weights
hf_whisper.load_state_dict(mapped_weights)
# Save the updated model
hf_whisper.save_pretrained("hf_whisper_converted")
#verify
from transformers import WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-large")
audio_path = "path_to_audio.wav"
inputs = processor(audio_path, return_tensors="pt", sampling_rate=16000)
generated_ids = hf_whisper.generate(**inputs)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(f"Transcription: {transcription}")
Then
ctranslate2-converter --model hf_whisper_converted --output_dir whisper_ctranslate2 --quantization
Do I do it correctly?
Exactly, if the model you have in not in huggingface format, you need to convert it first to that format then to CT2
Can I serve speechbrain trained model whisper with faster whisper?