I came to know that in order to use OpenWhisper to transcribe accurately one needs to seperate the voice from other parts of the audio and to remove silence with a pre-trained enterprise-grade Voice Activity Detector.
Now what model does voice seperation the best for OpenWhisper to understand could be a nice test case in the main table. Maybe someone could share their favorite?
I see https://mvsep.com uses OpenWhisper - What models does the site use to clean the audio as without cleaning Whisper spits out mostly garbage.
Edit: https://mvsep.com uses MDX23C - but how does one clean enough the audio without a VAD for Whisper not to hallucinate?
I came to know that in order to use OpenWhisper to transcribe accurately one needs to seperate the voice from other parts of the audio and to remove silence with a pre-trained enterprise-grade Voice Activity Detector.
Now what model does voice seperation the best for OpenWhisper to understand could be a nice test case in the main table. Maybe someone could share their favorite?
I see https://mvsep.com uses OpenWhisper - What models does the site use to clean the audio as without cleaning Whisper spits out mostly garbage. Edit: https://mvsep.com uses MDX23C - but how does one clean enough the audio without a VAD for Whisper not to hallucinate?