I've used the medium multilingual model (set to GPU implementation) to transcribe a Spanish audio and I keep getting the same transcription all the time:
I've tried with a different audio from the same podcast, but I got a similar result. It is as if the model thought the audio only contains music. It does start with a portion of music, but then there are two people talking.
Hi, I've tried with a different audio and a different model (the small one) and, although it worked for the first 5 minutes, after that, it's all "[música]".
Hi,
I've used the medium multilingual model (set to GPU implementation) to transcribe a Spanish audio and I keep getting the same transcription all the time:
[00:00:00.000 --> 00:00:03.000] [Música] [00:00:04.000 --> 00:00:07.000] [Música] [00:00:30.000 --> 00:00:33.000] [Música] [00:00:33.000 --> 00:00:36.000] [Música] [00:00:36.000 --> 00:00:39.000] [Música] [00:00:39.000 --> 00:00:42.000] [Música] [00:00:42.000 --> 00:00:45.000] [Música] ...
This is the audio: https://drive.google.com/file/d/1lfbAQWiA2JyKp6IAuqDvMkKOj4EIM9bh/view?usp=sharing
This is the transcription: https://drive.google.com/file/d/1jeNJPUNMUNvD5i_ONKTQW5lwtsbuhSxV/view?usp=sharing
I've tried with a different audio from the same podcast, but I got a similar result. It is as if the model thought the audio only contains music. It does start with a portion of music, but then there are two people talking.
It worked with the large model.
Thanks you for your help.