[Model A] Support Gemini API

johnklee commented 1 month ago

Gemini API support to input audio file (details) and we want to evaluate it in our model a.

johnklee commented 1 month ago

Usage demo:

>>> from speech_recognition_wrapper import GeminiWrapper
>>> import wrapper
>>> sr_gemini = GeminiWrapper(wrapper.LangEnum.en)
>>> sr_gemini.audio_2_text('/tmp/en_20240108_johnlee.wav')
Audio2TextData(text='Hello this is for testing in English We will use this to evaluate model SST and see how it performs thanks ', spent_time_sec=2.483170509338379)

johnklee commented 1 month ago

The evaluation result:

+----------------------------+--------------------------------+-------+----------+
|           Input            |              Name              | Score | Time (s) |
+----------------------------+--------------------------------+-------+----------+
| en_20240305_louischang.wav |     SpeechRecognition/GCP      |  0.32 |   6.41   |
| en_20240305_louischang.wav |  SpeechRecognition/WhisperAPI  |  0.14 |   1.44   |
| en_20240305_louischang.wav | Gemini/models/gemini-1.5-flash |  0.36 |   2.46   |
|  en_20240121_johnlee.wav   |     SpeechRecognition/GCP      |  0.0  |   1.04   |
|  en_20240121_johnlee.wav   |  SpeechRecognition/WhisperAPI  |  0.0  |   1.25   |
|  en_20240121_johnlee.wav   | Gemini/models/gemini-1.5-flash |  0.0  |   1.23   |
|   en_20240121_chung.wav    |     SpeechRecognition/GCP      |  0.2  |  10.71   |
|   en_20240121_chung.wav    |  SpeechRecognition/WhisperAPI  |  0.16 |   2.45   |
|   en_20240121_chung.wav    | Gemini/models/gemini-1.5-flash |  0.13 |   1.93   |
|  en_20240108_johnlee.wav   |     SpeechRecognition/GCP      |  0.05 |   2.86   |
|  en_20240108_johnlee.wav   |  SpeechRecognition/WhisperAPI  |  0.0  |   2.21   |
|  en_20240108_johnlee.wav   | Gemini/models/gemini-1.5-flash |  0.0  |   1.74   |
+----------------------------+--------------------------------+-------+----------+

Ranking Table (metric=<class 'metrics.WordErrorRateMetric'>)
+--------------------------------+------------+--------------+
|              Name              | Avg. score | Avg. Time(s) |
+--------------------------------+------------+--------------+
|  SpeechRecognition/WhisperAPI  |    0.08    |     1.84     |
| Gemini/models/gemini-1.5-flash |    0.12    |     1.84     |
|     SpeechRecognition/GCP      |    0.14    |     5.26     |
+--------------------------------+------------+--------------+

In terms of metrics.WordErrorRateMetric, Gemini (Gemini/models/gemini-1.5-flash)'s performance is slightly better than SpeechRecognition/WhisperAPI and worse than SpeechRecognition/GCP.

johnklee commented 1 month ago

Done in https://github.com/Cockatoo-AI-Org/Cockatoo.AI/pull/51

Cockatoo-AI-Org / Cockatoo.AI

[Model A] Support Gemini API #52