Closed StephanAkkerman closed 1 month ago
The current eng_latn_us_broad.tsv has 77k rows and en_US has 125k rows. This could increase eval results but decrease performance.
This will only improve search for top x words
With the new dataset & fallback: Phonetic Similarity Evaluation Results: method pearson_corr spearman_corr panphon 0.418678 0.474388 clts 0.353249 0.357871
without fallback: Phonetic Similarity Evaluation Results: method pearson_corr spearman_corr panphon 0.599606 0.604186 clts 0.515782 0.491983
When only using the G2P model: Phonetic Similarity Evaluation Results: method pearson_corr spearman_corr panphon 0.376618 0.444363 clts 0.277380 0.355129
After scaling with minmax scaler:
eng latn us broad: Phonetic Similarity Evaluation Results: method pearson_corr spearman_corr panphon 0.359480 0.382383 clts 0.397452 0.417966
Phonetic Similarity Evaluation Results: method pearson_corr spearman_corr panphon 0.418678 0.474388 clts 0.353249 0.357871
en_us has better perf, but lower speed
A second option would be to use https://github.com/open-dict-data/ipa-dict/blob/master/data/en_US.txt, which is also used by the G2P model. We should see if this improves eval performance:
With fallback: Phonetic Similarity Evaluation Results: method pearson_corr spearman_corr panphon 0.623008 0.618722 clts 0.400138 0.400633
Without: Phonetic Similarity Evaluation Results: method pearson_corr spearman_corr panphon 0.542018 0.537695 clts 0.610213 0.605076