Open hayrapetyan8768 opened 1 year ago
it's very weird, i try openai whisper and faster whisper (both large-v2) and they all stop at 14s, meanwhile ffmpeg confirm length = 18s, idk armenian so i cant compare the transcripts
To my experience whisper works best with a beginning silence and an ending silence, both of at least 2 seconds but not much longer. Try adding silence at the end and repeat the transcription
This is observed for Hindi audio clips as well, when the audio is much longer than 30s (4 mins in my case), and there is no natural pause or silence for more than 60s. Even though whisper chunks the audio into clips of 30s each, I see that for each chunk, I get transcription for 16 to 18s and the rest is dropped.
Model used: IndicWhisper (https://huggingface.co/vasista22/whisper-hindi-large-v2) that was finetuned on whisper large-v2.
@raghuch Can you share an audio input with this issue?
Hi @guillaumekln, I am unable to attach the original mp3 or the converted 16kHz wav file here, so please check these links: 16kHz wav: https://drive.google.com/file/d/1Wo1I9KlN9jGUTLXQ7DIQuNjsZ62LzHyz/view?usp=drive_link original mp3: https://drive.google.com/file/d/1z-EfzOaFBVVcjBbrRfs79f9bea1p9Ii2/view?usp=drive_link
Code used is almost the same as in the README.md section of this repo:
from faster_whisper import WhisperModel
model = WhisperModel("/home/raghu/work/indic-whisper/hindi_models/whisper-large-hi-noldcil-faster/", device='cuda', cpu_threads=8, compute_type='float16')
segments, info = model.transcribe("Hindi-prati-ghanta-samachar-2023715171251_16k.wav", beam_size=4, best_of=8)
for i in segments:
print(i.start, i.end, i.text)
And the model "whisper-large-hi-noldcil-faster" is converted from the original via ctranslate2 using this command:
ct2-transformers-converter --model /home/raghu/work/indic-whisper/hindi_models/whisper-large-hi-noldcil --output_dir /home/raghu/work/indic-whisper/hindi_models/whisper-large-hi-noldcil-faster --quantization float16
I am not sure if a task='translate'
option like in plain whisper inference or any other options that might help here.
Any update on this issue? I have the same exact behaviour on Whisper Large v2 fine tuned on Common Voice FR data
I was having same issue. I was able to resolve the issue by using ffmpeg to encode it using the LAME MP3 encoder with the highest quality setting.
ffmpeg -i problematic_audio.mp3 -c:a libmp3lame -q:a 0 output.mp3
I'm working with Armenian language. Before transcription I've converted Hugging Face whisper medium model which was fine tuned on Armenian data, using ct2-transformers-converter --model model/path --output_dir converted/model/path While transcribing audio less than 30 seconds it transcribes only first 10-12 seconds, if audio is longer it cuts audio into 30 second chunks but again transcribes only first 10-12 seconds of every chunk. here's an audio example and it's output
https://github.com/guillaumekln/faster-whisper/assets/72487857/b75c63a0-c1cc-44c8-8107-3c5fa1442b6c