So I wanted to deploy seamless-m4t-v2 and tested it on some german librivox files, with clear speech and without noise.
Unfortunately the transcription fails by not transcribing anything except for a few words. I also tested the file on Huggingface hosted inference model and it produced the same result.
In Whisper-v2 however the transcription with the same file works just fine. I find this behaviour very strange, since the quality is really good in terms of audio. Of course everything was sampled to 16khz.
My minimal reproducer looks like this:
from transformers import SeamlessM4Tv2ForSpeechToText, AutoProcessor
import torchaudio
import time
processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
model = SeamlessM4Tv2ForSpeechToText.from_pretrained("facebook/seamless-m4t-v2-large")
model.to("cuda:0")
aud, sr = torchaudio.load("/home/.../Music/1_30.wav")
transform = torchaudio.transforms.Resample(sr, 16000)
aud = transform(aud)
audio_inputs = processor(audios=aud, return_tensors="pt", sampling_rate=16000)
audio_inputs.to("cuda:0")
print(audio_inputs)
out = model.generate(**audio_inputs, tgt_lang="deu")
time.sleep(5)
print(out)
out2 = processor.decode(out[0].tolist(), skip_special_tokens=True)
print(out2)
The file I used 1_30.zip is a 30s snippet of the first chapter in this book: Librivox Link .
So I wanted to deploy seamless-m4t-v2 and tested it on some german librivox files, with clear speech and without noise. Unfortunately the transcription fails by not transcribing anything except for a few words. I also tested the file on Huggingface hosted inference model and it produced the same result.
In Whisper-v2 however the transcription with the same file works just fine. I find this behaviour very strange, since the quality is really good in terms of audio. Of course everything was sampled to 16khz.
My minimal reproducer looks like this:
The file I used 1_30.zip is a 30s snippet of the first chapter in this book: Librivox Link .