SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
11.96k stars 1k forks source link

Missing ending of transcription #351

Open hayrapetyan8768 opened 1 year ago

hayrapetyan8768 commented 1 year ago

I'm working with Armenian language. Before transcription I've converted Hugging Face whisper medium model which was fine tuned on Armenian data, using ct2-transformers-converter --model model/path --output_dir converted/model/path While transcribing audio less than 30 seconds it transcribes only first 10-12 seconds, if audio is longer it cuts audio into 30 second chunks but again transcribes only first 10-12 seconds of every chunk. here's an audio example and it's output

https://github.com/guillaumekln/faster-whisper/assets/72487857/b75c63a0-c1cc-44c8-8107-3c5fa1442b6c

[0.00s -> 18.04s] առողջապահության նախարարությունը մշակել են նոր ռազմավարությունը այս անգամ առողջության առաջնային պահպանման օղակի առնչվող առողջապահության փոխնախարարի խոսքով հինգամյա նոր ռազմավարությունը ոչ �

phineas-pta commented 1 year ago

it's very weird, i try openai whisper and faster whisper (both large-v2) and they all stop at 14s, meanwhile ffmpeg confirm length = 18s, idk armenian so i cant compare the transcripts

x86Gr commented 1 year ago

To my experience whisper works best with a beginning silence and an ending silence, both of at least 2 seconds but not much longer. Try adding silence at the end and repeat the transcription

raghuch commented 1 year ago

This is observed for Hindi audio clips as well, when the audio is much longer than 30s (4 mins in my case), and there is no natural pause or silence for more than 60s. Even though whisper chunks the audio into clips of 30s each, I see that for each chunk, I get transcription for 16 to 18s and the rest is dropped.

Model used: IndicWhisper (https://huggingface.co/vasista22/whisper-hindi-large-v2) that was finetuned on whisper large-v2.

guillaumekln commented 1 year ago

@raghuch Can you share an audio input with this issue?

raghuch commented 1 year ago

Hi @guillaumekln, I am unable to attach the original mp3 or the converted 16kHz wav file here, so please check these links: 16kHz wav: https://drive.google.com/file/d/1Wo1I9KlN9jGUTLXQ7DIQuNjsZ62LzHyz/view?usp=drive_link original mp3: https://drive.google.com/file/d/1z-EfzOaFBVVcjBbrRfs79f9bea1p9Ii2/view?usp=drive_link

Code used is almost the same as in the README.md section of this repo:

from faster_whisper import WhisperModel

model = WhisperModel("/home/raghu/work/indic-whisper/hindi_models/whisper-large-hi-noldcil-faster/", device='cuda', cpu_threads=8, compute_type='float16')

segments, info = model.transcribe("Hindi-prati-ghanta-samachar-2023715171251_16k.wav", beam_size=4, best_of=8)

for i in segments:
    print(i.start, i.end, i.text)

And the model "whisper-large-hi-noldcil-faster" is converted from the original via ctranslate2 using this command:

ct2-transformers-converter --model /home/raghu/work/indic-whisper/hindi_models/whisper-large-hi-noldcil --output_dir /home/raghu/work/indic-whisper/hindi_models/whisper-large-hi-noldcil-faster --quantization float16

I am not sure if a task='translate' option like in plain whisper inference or any other options that might help here.

Lorenzoncina commented 1 year ago

Any update on this issue? I have the same exact behaviour on Whisper Large v2 fine tuned on Common Voice FR data

JoshMorrison99 commented 1 year ago

I was having same issue. I was able to resolve the issue by using ffmpeg to encode it using the LAME MP3 encoder with the highest quality setting.

ffmpeg -i problematic_audio.mp3 -c:a libmp3lame -q:a 0 output.mp3