SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
11.9k stars 1k forks source link

Transcription gap after songs #673

Open EricVee68 opened 8 months ago

EricVee68 commented 8 months ago

Situation:
Using the large-v2 model, beamsize=5; cuda; fp16. Transcribing and Translating a pre recorded song competition. All dialog before and typically during a performance, is captured and translated, no issues. Toward the very end, or immediately after a song, there is often a 2-4 minute section that goes completely "ignored" and not transcribed/translated.

I completely understand the likelihood of missing some portions of a song, but need assistance in making the translation recover quickly afterward so that commentary can be captured.

Thoughts?

Purfview commented 8 months ago

It's hard to tell anything without the audio.

Purfview commented 8 months ago

Yesterday a similar issue was posted on my repo -> https://github.com/Purfview/whisper-standalone-win/issues/188

Sometimes changing compute type or beam size triggers a model to transcribe those missing lines. Sometimes nothing helps and only the small models transcribe those missing lines (this I observed with the lines which are "ad like").

Here is the example of the "ad like" issue -> https://github.com/openai/whisper/discussions/1937

EricVee68 commented 8 months ago

Yesterday a similar issue was posted on my repo -> Purfview/whisper-standalone-win#188

Sometimes changing compute type or beam size triggers a model to transcribe those missing lines. Sometimes nothing helps and only the small models transcribe those missing lines (this I observed with the lines which are "ad like").

Here is the example of the "ad like" issue -> openai/whisper#1937

Thanks much - You've given me things to ponder. For the sake of fully testing, I'm going to run it through with float16/bs1 and bs5, float32/bs1 and bs5, and probably VAD on and off too. Depending on where that takes me, will try the slowing down the audio theories. Stay tuned!