SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
11.96k stars 1k forks source link

Short Audio Transcription and Detection of phone ringing. #580

Open pranavbhat12 opened 11 months ago

pranavbhat12 commented 11 months ago

I am facing 2 issues while transcribing the audio files:

  1. Not able to get transcription for shorter audios of 20-30 sec length.Tried for multiple audios but results are not good.
  2. In some audios the initial part is some caller tune or just phone ringing.For this audio the transcript is really bad which I think is because model considers this part as silence and generates some repetition of words or the text in the prompt parameter itself.If I trim this initial part I am getting the transcription correct.Is there any way to auto detect such ringing tones at the start?

Really appreciate for the help.Thankyou.

EtienneAb3d commented 11 months ago

Try to extract vocals with spleeter. See more interesting processing (like silence/noise removal) here: https://github.com/EtienneAb3d/WhisperHallu

blackpolarz commented 11 months ago

Have you tried playing with the VAD parameter that is built in to faster-whisper? As EtienneAb3d mentioned, extracting vocals with spleeter/demucs does help but it may or may not hurt your transcription as both models of spleeter and demucs are using a sample rate of 44.1khz while whisper is trained on sample rate of 16khz.

liyaodev commented 9 months ago

I have tried, after the voice audio file after extraction, the transcription effect is not as good as using the original audio directly. Config VAD greatly optimizes the problem of misidentification caused by non-human voices,here is my config:

vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500)

more config see -> https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/vad.py