Medium multilingual model not working

Hi,

I've used the medium multilingual model (set to GPU implementation) to transcribe a Spanish audio and I keep getting the same transcription all the time:

[00:00:00.000 --> 00:00:03.000] [Música] [00:00:04.000 --> 00:00:07.000] [Música] [00:00:30.000 --> 00:00:33.000] [Música] [00:00:33.000 --> 00:00:36.000] [Música] [00:00:36.000 --> 00:00:39.000] [Música] [00:00:39.000 --> 00:00:42.000] [Música] [00:00:42.000 --> 00:00:45.000] [Música] ...

This is the audio: https://drive.google.com/file/d/1lfbAQWiA2JyKp6IAuqDvMkKOj4EIM9bh/view?usp=sharing

This is the transcription: https://drive.google.com/file/d/1jeNJPUNMUNvD5i_ONKTQW5lwtsbuhSxV/view?usp=sharing

I've tried with a different audio from the same podcast, but I got a similar result. It is as if the model thought the audio only contains music. It does start with a portion of music, but then there are two people talking.

It worked with the large model.

Thanks you for your help.

Const-me / Whisper

Medium multilingual model not working #80