Prepare/Develop a utility to evaluate Model A (ASR/STT)

johnklee commented 9 months ago

From #6 , we conducted a research to look for promising Model A candidates. From this issue, we are going to develop a utility to evaluate those approaches with the performance metrics we feel interested in.

Reference:

johnklee commented 8 months ago

For voice in Chinese, we have evaluation result so far as below:

# ./eval.py  --lang=cn
=== Processing cn_20240108_johnlee.wav ===
Ground truth text:
中文測試. 這的檔案是使用來檢視模型 SST 的轉換效果. 謝謝.

--------------------
Transformed text (SpeechRecognition/GCP):
中文廁所這個檔案是使用來顯示模型s s t這種怪獸謝謝

Transformed text (SpeechRecognition/WhisperAPI):
中文測試,這個檔案是使用來檢視模型SSD的轉換效果 謝謝

=== Processing cn_20240121_chung.wav ===
Ground truth text:
如果想要將聲音轉換成數位類比訊號的話, 有兩個重要參數影響其品質, 分別是取樣率以及位元深度. 取樣率決定了頻率也就是每秒樣本數的多寡, 越多當然就越接近真實世界的聲音, 而位元深度則決定了記憶體大小來儲存每個樣本.

--------------------
Transformed text (SpeechRecognition/GCP):
如果想要將聲音轉換成數位類比訊號的話有兩個重要參數影響起品質分別是取樣率以及位元深度取樣率決定的頻率也就是每秒樣本數的多寡越多當然就越接近真實世界的聲音而為人深度則決定的記憶體大小來儲存每個樣本

Transformed text (SpeechRecognition/WhisperAPI):
如果想要將聲音轉換成數位類比訊號的話 有兩個重要參數影響其品質 分別是取樣率以及外源深度 取樣率決定了頻率,也就是每秒樣本數的多寡 越多當然就越接近真實世界的聲音 而外源深度則決定了記憶體大小來儲存每個樣本

Raw data (LangEnum.cn)
+-------------------------+------------------------------+-------+----------+
|          Input          |             Name             | Score | Time (s) |
+-------------------------+------------------------------+-------+----------+
| cn_20240108_johnlee.wav |    SpeechRecognition/GCP     |  0.36 |   6.27   |
| cn_20240108_johnlee.wav | SpeechRecognition/WhisperAPI |  0.88 |   2.42   |
|  cn_20240121_chung.wav  |    SpeechRecognition/GCP     |  0.93 |   15.1   |
|  cn_20240121_chung.wav  | SpeechRecognition/WhisperAPI |  0.96 |   2.54   |
+-------------------------+------------------------------+-------+----------+

Ranking Table
+------------------------------+------------+--------------+
|             Name             | Avg. score | Avg. Time(s) |
+------------------------------+------------+--------------+
| SpeechRecognition/WhisperAPI |    0.92    |     2.48     |
|    SpeechRecognition/GCP     |    0.65    |    10.68     |
+------------------------------+------------+--------------+

johnklee commented 8 months ago

For new metric wer a.k.a Word Error Rate, the lower the score the best. We could use argument -m or --metric to switch metrics. I made it as the default option. For our current dataset, it looks like:

English corpus:


Raw data (LangEnum.en)
+-------------------------+------------------------------+-------+----------+
|          Input          |             Name             | Score | Time (s) |
+-------------------------+------------------------------+-------+----------+
| en_20240108_johnlee.wav |    SpeechRecognition/GCP     |  0.05 |   3.41   |
| en_20240108_johnlee.wav | SpeechRecognition/WhisperAPI |  0.0  |   2.54   |
| en_20240121_johnlee.wav |    SpeechRecognition/GCP     |  0.0  |   1.13   |
| en_20240121_johnlee.wav | SpeechRecognition/WhisperAPI |  0.0  |   1.51   |
|  en_20240121_chung.wav  |    SpeechRecognition/GCP     |  0.2  |   7.87   |
|  en_20240121_chung.wav  | SpeechRecognition/WhisperAPI |  0.16 |   3.54   |
+-------------------------+------------------------------+-------+----------+

Ranking Table (metric=<class 'metrics.WordErrorRateMetric'>) +------------------------------+------------+--------------+ | Name | Avg. score | Avg. Time(s) | +------------------------------+------------+--------------+ | SpeechRecognition/WhisperAPI | 0.05 | 2.53 | | SpeechRecognition/GCP | 0.08 | 4.14 | +------------------------------+------------+--------------+

Summary: `WhisperAPI` performs better than `GCP` on both score and speed.

- Chinese corpus
```shell
Raw data (LangEnum.cn)
+-----------------------+------------------------------+-------+----------+
|         Input         |             Name             | Score | Time (s) |
+-----------------------+------------------------------+-------+----------+
| cn_20240121_chung.wav |    SpeechRecognition/GCP     |  0.16 |   8.33   |
| cn_20240121_chung.wav | SpeechRecognition/WhisperAPI |  0.14 |   2.93   |
+-----------------------+------------------------------+-------+----------+

Ranking Table (metric=<class 'metrics.WordErrorRateMetric'>)
+------------------------------+------------+--------------+
|             Name             | Avg. score | Avg. Time(s) |
+------------------------------+------------+--------------+
| SpeechRecognition/WhisperAPI |    0.14    |     2.93     |
|    SpeechRecognition/GCP     |    0.16    |     8.33     |
+------------------------------+------------+--------------+

Summary: WhisperAPI performs better than GCP on both score and speed.

johnklee commented 7 months ago

Hi @wkCircle ,

I am about to close this issue and move to https://github.com/Cockatoo-AI-Org/Cockatoo.AI/issues/40 if no further task or suggestion left for this issue. Thanks!

Cockatoo-AI-Org / Cockatoo.AI

Prepare/Develop a utility to evaluate Model A (ASR/STT) #30