Open EricVee68 opened 8 months ago
It's hard to tell anything without the audio.
Yesterday a similar issue was posted on my repo -> https://github.com/Purfview/whisper-standalone-win/issues/188
Sometimes changing compute type or beam size triggers a model to transcribe those missing lines. Sometimes nothing helps and only the small models transcribe those missing lines (this I observed with the lines which are "ad like").
Here is the example of the "ad like" issue -> https://github.com/openai/whisper/discussions/1937
Yesterday a similar issue was posted on my repo -> Purfview/whisper-standalone-win#188
Sometimes changing compute type or beam size triggers a model to transcribe those missing lines. Sometimes nothing helps and only the small models transcribe those missing lines (this I observed with the lines which are "ad like").
Here is the example of the "ad like" issue -> openai/whisper#1937
Thanks much - You've given me things to ponder. For the sake of fully testing, I'm going to run it through with float16/bs1 and bs5, float32/bs1 and bs5, and probably VAD on and off too. Depending on where that takes me, will try the slowing down the audio theories. Stay tuned!
Situation:
Using the large-v2 model, beamsize=5; cuda; fp16. Transcribing and Translating a pre recorded song competition. All dialog before and typically during a performance, is captured and translated, no issues. Toward the very end, or immediately after a song, there is often a 2-4 minute section that goes completely "ignored" and not transcribed/translated.
I completely understand the likelihood of missing some portions of a song, but need assistance in making the translation recover quickly afterward so that commentary can be captured.
Thoughts?