Closed johnklee closed 1 month ago
Usage demo:
>>> from speech_recognition_wrapper import GeminiWrapper
>>> import wrapper
>>> sr_gemini = GeminiWrapper(wrapper.LangEnum.en)
>>> sr_gemini.audio_2_text('/tmp/en_20240108_johnlee.wav')
Audio2TextData(text='Hello this is for testing in English We will use this to evaluate model SST and see how it performs thanks ', spent_time_sec=2.483170509338379)
The evaluation result:
+----------------------------+--------------------------------+-------+----------+
| Input | Name | Score | Time (s) |
+----------------------------+--------------------------------+-------+----------+
| en_20240305_louischang.wav | SpeechRecognition/GCP | 0.32 | 6.41 |
| en_20240305_louischang.wav | SpeechRecognition/WhisperAPI | 0.14 | 1.44 |
| en_20240305_louischang.wav | Gemini/models/gemini-1.5-flash | 0.36 | 2.46 |
| en_20240121_johnlee.wav | SpeechRecognition/GCP | 0.0 | 1.04 |
| en_20240121_johnlee.wav | SpeechRecognition/WhisperAPI | 0.0 | 1.25 |
| en_20240121_johnlee.wav | Gemini/models/gemini-1.5-flash | 0.0 | 1.23 |
| en_20240121_chung.wav | SpeechRecognition/GCP | 0.2 | 10.71 |
| en_20240121_chung.wav | SpeechRecognition/WhisperAPI | 0.16 | 2.45 |
| en_20240121_chung.wav | Gemini/models/gemini-1.5-flash | 0.13 | 1.93 |
| en_20240108_johnlee.wav | SpeechRecognition/GCP | 0.05 | 2.86 |
| en_20240108_johnlee.wav | SpeechRecognition/WhisperAPI | 0.0 | 2.21 |
| en_20240108_johnlee.wav | Gemini/models/gemini-1.5-flash | 0.0 | 1.74 |
+----------------------------+--------------------------------+-------+----------+
Ranking Table (metric=<class 'metrics.WordErrorRateMetric'>)
+--------------------------------+------------+--------------+
| Name | Avg. score | Avg. Time(s) |
+--------------------------------+------------+--------------+
| SpeechRecognition/WhisperAPI | 0.08 | 1.84 |
| Gemini/models/gemini-1.5-flash | 0.12 | 1.84 |
| SpeechRecognition/GCP | 0.14 | 5.26 |
+--------------------------------+------------+--------------+
In terms of metrics.WordErrorRateMetric
, Gemini (Gemini/models/gemini-1.5-flash
)'s performance is slightly better than SpeechRecognition/WhisperAPI
and worse than SpeechRecognition/GCP
.
Gemini API support to input audio file (details) and we want to evaluate it in our model a.