ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
32.99k stars 3.3k forks source link

Language detection - Trim silence #1104

Open hermify opened 12 months ago

hermify commented 12 months ago

Hi there,

while setting language to "auto" and having a file with the first 40 seconds have silence, it detects language "ca" (Chinese). It would be great, that the language detector would trim silence from the audio, before it does language detection.

Because i will have to create subtitle vtt's, I can't remove silence for my own because of the timings.

emcodem commented 11 months ago

Detecting silence is not the correct way, we'd need to use VAD for that at least. Simple workaround: extract a 30sec portion from somewhere else in the file and use that for lang detection, that step does not cost any time really. E.g. using ffmpeg (start at 60 seconds): ffmpeg -ss 60 -i YOURFILE -t 30 DETECTIONFILE.wav

hermify commented 11 months ago

Just simple and Interesting workaround! So i will run the model with the detection file and language = auto and grep the information from the output "language: it (p = 0.9)".

I am wondering about: Maybe using a smaller model will be enough and makes the process faster? While using "large" takes about 1x time. The medium or small model will maybe enough for language detection?

emcodem commented 11 months ago

The medium or small model will maybe enough for language detection?

Thats a fantastic idea! i assume even the tiny model would serve this purpose very well.

hermify commented 11 months ago

I would add a highpass and lowpass and trim the silence.

Something like this: ffmpeg -i file.wav -af "silencedetect=n=-50dB:d=2, highpass=f=200, lowpass=f=4500, anullsrc=channel_layout=stereo:sample_rate=44100"

pannous commented 7 months ago

also see https://github.com/ggerganov/whisper.cpp/issues/1507