Const-me / Whisper

High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model
Mozilla Public License 2.0
8.23k stars 704 forks source link

Medium multilingual model not working #80

Open vivadavid opened 1 year ago

vivadavid commented 1 year ago

Hi,

I've used the medium multilingual model (set to GPU implementation) to transcribe a Spanish audio and I keep getting the same transcription all the time:

[00:00:00.000 --> 00:00:03.000] [Música] [00:00:04.000 --> 00:00:07.000] [Música] [00:00:30.000 --> 00:00:33.000] [Música] [00:00:33.000 --> 00:00:36.000] [Música] [00:00:36.000 --> 00:00:39.000] [Música] [00:00:39.000 --> 00:00:42.000] [Música] [00:00:42.000 --> 00:00:45.000] [Música] ...

This is the audio: https://drive.google.com/file/d/1lfbAQWiA2JyKp6IAuqDvMkKOj4EIM9bh/view?usp=sharing

This is the transcription: https://drive.google.com/file/d/1jeNJPUNMUNvD5i_ONKTQW5lwtsbuhSxV/view?usp=sharing

I've tried with a different audio from the same podcast, but I got a similar result. It is as if the model thought the audio only contains music. It does start with a portion of music, but then there are two people talking.

It worked with the large model.

Thanks you for your help.

vivadavid commented 1 year ago

Hi, I've tried with a different audio and a different model (the small one) and, although it worked for the first 5 minutes, after that, it's all "[música]".